Nikolaus Rath
2022-May-15 15:45 UTC
[Libguestfs] Communication issues between NBD driver and NBDKit server
Hi, I am observing some strange errors when using the Kernel's NBD driver with NBDkit. On the kernel side, I see: May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0 to 104857600 May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 104857600 May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request 00000000ae5feee7: control (write at 4836316160,32768B). Runtime 30 seconds May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request 000000007094eddc: control (write at 5372947456,10240B). Runtime 30 seconds May 15 16:18:27 vostro.rath.org kernel: block nbd0: Suspicious reply 89 (status 0 flags 0) May 15 16:18:31 vostro.rath.org kernel: block nbd0: Possible stuck request 0000000075f8b9bc: control (write at 8057764864,32768B). Runtime 30 seconds May 15 16:18:41 vostro.rath.org kernel: block nbd0: Possible stuck request 000000002d1b3e8b: control (write at 14499979264,32768B). Runtime 30 seconds [...] And userspace ('zfs snapshot" in this instance) is stuck afterwards. On the NBDkit side, there seemingly are write errors when replying back to the kernel: $ nbdkit --unix /tmp/tmpi5o59_y_/nbd_socket_sb --foreground --filter=exitlast --filter=stats --threads 16 S3 size=50G bucket=nikratio-backup key=sb statsfile=/tmp/tmpi5o59_y_/stats_sb.txt object-size=32K & $ nbd-client -unix /tmp/tmpi5o59_y_/nbd_socket_sb /dev/nbd2 Warning: the oldstyle protocol is no longer supported. This method now uses the newstyle protocol with a default export Negotiation: ..size = 51200MB Connected /dev/nbd0 [....] nbdkit: python.10: error: write reply: NBD_CMD_WRITE: Broken pipe What's the best way to narrow down who's the culprit here (kernel vs NBD server)? Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F ?Time flies like an arrow, fruit flies like a Banana.? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libguestfs/attachments/20220515/0680727a/attachment.htm>
Josef Bacik
2022-May-15 15:51 UTC
[Libguestfs] Communication issues between NBD driver and NBDKit server
On Sun, May 15, 2022 at 11:45 AM Nikolaus Rath <nikolaus at rath.org> wrote:> > Hi, > > I am observing some strange errors when using the Kernel's NBD driver with NBDkit. > > On the kernel side, I see: > > May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0 to 104857600 > May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 104857600 > May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request 00000000ae5feee7: control (write at 4836316160,32768B). Runtime 30 seconds > May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request 000000007094eddc: control (write at 5372947456,10240B). Runtime 30 secondsThe server isn't responding to the request fast enough, and you don't have a timeout set so it'll just hang until you disconnect.> May 15 16:18:27 vostro.rath.org kernel: block nbd0: Suspicious reply 89 (status 0 flags 0)The server double replied to a request. Thanks, Josef
Nikolaus Rath
2022-May-15 16:39 UTC
[Libguestfs] Communication issues between NBD driver and NBDKit server
On May 15 2022, Josef Bacik <josef at toxicpanda.com> wrote:> On Sun, May 15, 2022 at 11:45 AM Nikolaus Rath <nikolaus at rath.org> wrote: >> >> Hi, >> >> I am observing some strange errors when using the Kernel's NBD driver with NBDkit. >> >> On the kernel side, I see: >> >> May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0 to 104857600 >> May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 104857600 >> May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request >> 00000000ae5feee7: control (write at 4836316160,32768B). Runtime 30 seconds >> May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request >> 000000007094eddc: control (write at 5372947456,10240B). Runtime 30 seconds > > The server isn't responding to the request fast enough, and you don't > have a timeout set so it'll just hang until you disconnect.Are you saying it will continue to hang even after the request has been responded to? Because NBDkit claims to have sent a response (I'm not sure if within 30 seconds, but definitely at some point). Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F ?Time flies like an arrow, fruit flies like a Banana.?
Richard W.M. Jones
2022-May-15 18:05 UTC
[Libguestfs] Communication issues between NBD driver and NBDKit server
On Sun, May 15, 2022 at 04:45:11PM +0100, Nikolaus Rath wrote:> Hi, > > I am observing some strange errors when using the Kernel's NBD driver with > NBDkit. > > On the kernel side, I see: > > May 15 16:16:11 vostro.rath.org kernel: nbd0: detected capacity change from 0 > to 104857600 > May 15 16:16:11 vostro.rath.org kernel: nbd1: detected capacity change from 0 > to 104857600 > May 15 16:18:23 vostro.rath.org kernel: block nbd0: Possible stuck request > 00000000ae5feee7: control (write at 4836316160,32768B). Runtime 30 seconds > May 15 16:18:25 vostro.rath.org kernel: block nbd0: Possible stuck request > 000000007094eddc: control (write at 5372947456,10240B). Runtime 30 seconds > May 15 16:18:27 vostro.rath.org kernel: block nbd0: Suspicious reply 89 (status > 0 flags 0) > May 15 16:18:31 vostro.rath.org kernel: block nbd0: Possible stuck request > 0000000075f8b9bc: control (write at 8057764864,32768B). Runtime 30 seconds > May 15 16:18:41 vostro.rath.org kernel: block nbd0: Possible stuck request > 000000002d1b3e8b: control (write at 14499979264,32768B). Runtime 30 seconds > [...]Does it really take over 30 seconds for nbdkit to respond? You might want to insert some debugging into the S3 plugin to see what stage of the request cycle is taking so long, although I'm going to guess it's the remote Amazon server itself. It seems like you can adjust this timeout using the nbd-client -t flag (it calls ioctl(NBD_SET_TIMEOUT) in the kernel). If I understand the logic correctly, the nbd timeout is currently set to 0, which causes the default socket timeout to be used. Using the -t flag overrides this. So I guess try setting it larger and see if the problem goes away. Rich.> And userspace ('zfs snapshot" in this instance) is stuck afterwards. > > On the NBDkit side, there seemingly are write errors when replying back to the > kernel: > > $ nbdkit --unix /tmp/tmpi5o59_y_/nbd_socket_sb --foreground --filter=exitlast > --filter=stats --threads 16 S3 size=50G bucket=nikratio-backup key=sb statsfile > =/tmp/tmpi5o59_y_/stats_sb.txt object-size=32K & > $ nbd-client -unix /tmp/tmpi5o59_y_/nbd_socket_sb /dev/nbd2 > Warning: the oldstyle protocol is no longer supported. > This method now uses the newstyle protocol with a default export > Negotiation: ..size = 51200MB > Connected /dev/nbd0 > [....] > nbdkit: python.10: error: write reply: NBD_CMD_WRITE: Broken pipe > > > What's the best way to narrow down who's the culprit here (kernel vs NBD > server)? > > Best, > -Nikolaus > > -- > GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F > > ?Time flies like an arrow, fruit flies like a Banana.? > >> _______________________________________________ > Libguestfs mailing list > Libguestfs at redhat.com > https://listman.redhat.com/mailman/listinfo/libguestfs-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org