thr3ads.net - Libguestfs - [Libguestfs] [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies [Jul 2023]

If this information is useful, please help other people find it:
Share via:

Laszlo Ersek

2023-Jun-01 13:42 UTC

[Libguestfs] [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies

On 5/25/23 15:00, Eric Blake wrote:> One of the benefits of extended replies is that we can do a
> fixed-length read for the entire header of every server reply, which
> is fewer syscalls than the split-read approach required by structured
> replies.
(Totally tangential comment: recvmsg() could scatter the incoming
traffic into non-contiguous fields of a non-packed structure. But I
don't know if TLS has anything similar. The current "linear
receive"
approach is probably the least demanding of the underlying socket
abstractions. At the same time it requires us to use packed structs,
and/or multiple syscalls.)
> But one of the drawbacks of doing a large read is that if
> the server is non-compliant (not a problem for normal servers, but
> something I hit rather more than I'd like to admit while developing
> extended header support in servers),
Haha, so this is where it's coming from! :) I can totally relate.
> nbd_pwrite() and friends will
> deadlock if the server replies with the wrong header.  Add in some
> code to catch that failure mode and move the state machine to DEAD
> sooner, to make it easier to diagnose the fault in the server.
> 
> Unlike in the case of an unexpected simply reply from a structured
(1) s/simply/simple/
> server (where we never over-read, and can therefore
(2) At this point my English parser gets thrown off:
> commit b31e7bac
> can merely fail the command with EPROTO and successfully move on to
> the next server reply),
resync here:
> in this case we really do have to move to
> DEAD: in addition to having already read the 16 or 20 bytes that the
> server sent in its (short) reply for this command, we may have already
> read the initial bytes of the server's next reply, but we have no way
> to push those extra bytes back onto our read stream for parsing on our
> next pass through the state machine.
> 
> Signed-off-by: Eric Blake <eblake at redhat.com>
> ---
>  generator/states-reply.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/generator/states-reply.c b/generator/states-reply.c
> index 4e9f2dde..d4710d91 100644
> --- a/generator/states-reply.c
> +++ b/generator/states-reply.c
> @@ -109,7 +109,28 @@  REPLY.START:
>   REPLY.RECV_REPLY:
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return 0;
> -  case 1: SET_NEXT_STATE (%.READY); return 0;
> +  case 1: SET_NEXT_STATE (%.READY);
> +    /* Special case: if we have a short read, but got at least far
> +     * enough to decode the magic number, we can check if the server
> +     * is matching our expectations. This lets us avoid deadlocking if
> +     * a buggy server sends only 16 bytes of a simple reply, and is
> +     * waiting for our next command, while we are blocked waiting for
> +     * the server to send 32 bytes of an extended reply.
> +     */
(4) Slight inconsistency between commit message and code comment: the
former mentions "20 bytes", but the latter doesn't mention
"structured
reply".
> +    if (h->extended_headers &&
> +        (char *)h->rbuf >= (char
*)&h->sbuf.reply.hdr.extended.flags) {
(.) I wonder if (address-of-magic + size-of magic) might be more
readable / greppable. Just in case.

Feel free to ignore.
> +      uint32_t magic = be32toh (h->sbuf.reply.hdr.extended.magic);
> +      if (magic != NBD_EXTENDED_REPLY_MAGIC) {
> +        SET_NEXT_STATE (%.DEAD); /* We've probably lost
synchronization. */
> +        set_error (0, "invalid or unexpected reply magic 0x%"
PRIx32, magic);
> +#if 0 /* uncomment to see desynchronized data */
> +        nbd_internal_hexdump (&h->sbuf.reply.hdr.extended.flags,
> +                              sizeof
(h->sbuf.reply.hdr.extended.flags),
> +                              stderr);
> +#endif
> +      }
> +    }
> +    return 0;
(5) This could be factored out to a helper function, to share the
"invalid:" label logic with the previous patch.

(6) Commencing a dump from "flags" onward looks questionable. At this
point, the "flags" field need not to be complete, or even started --
all
we know is that the "magic" field *before" "flags" is
complete.

I think we should dump "h->sbuf.reply.hdr.simple", like in patch#5.

(.) Side comment (so no bullet number assigned): because we can take
multiple iterations on RECV_REPLY, we may end up checking the "magic"
field multiple times. Not very costly, just something to be aware of.

(7) Assume that we have a short read first, but then complete
REPLY.RECV_REPLY successfully, and move to
REPLY.CHECK_SIMPLE_OR_STRUCTURED_REPLY.

Then, the following condition will have been caught (excluded) under
RECV_REPLY:

  (extended_headers && magic != NBD_EXTENDED_REPLY_MAGIC)     [1]

Consequently, the same condition will never hold in
REPLY.CHECK_SIMPLE_OR_STRUCTURED_REPLY.

Now consider REPLY.CHECK_SIMPLE_OR_STRUCTURED_REPLY, from the previous
patch:

   magic = be32toh (h->sbuf.reply.hdr.simple.magic);
   switch (magic) {
   case NBD_SIMPLE_REPLY_MAGIC:
     if (h->extended_headers)
       goto invalid;

This branch becomes dead code. The condition

  h->extended_headers && magic == NBD_SIMPLE_REPLY_MAGIC

would satisfy [1], therefore it can't happen here.

Further:

     SET_NEXT_STATE (%SIMPLE_REPLY.START);
     break;
   case NBD_STRUCTURED_REPLY_MAGIC:
   case NBD_EXTENDED_REPLY_MAGIC:
     if ((magic == NBD_STRUCTURED_REPLY_MAGIC) == h->extended_headers)
       goto invalid;

While this lovely condition is satisfied by "true == true", that is,
by:

  h->extended_headers && magic == NBD_STRUCTURED_REPLY_MAGIC    [2]

the same condition [2] satisfies [1] too, and therefore it can never occur.

The only way for satisfying this

  (magic == NBD_STRUCTURED_REPLY_MAGIC) == h->extended_headers

condition would be the "false == false" case, that is:

  !h->extended_headers && magic == NBD_EXTENDED_REPLY_MAGIC    [3]

(note that (magic != NBD_STRUCTURED_REPLY_MAGIC) is equivalent to (magic
== NBD_EXTENDED_REPLY_MAGIC) due to how the switch statement is structured,)

and then we should use that expression, i.e., [3], because it is more
readable than the double-equality (i.e., equivalence) form.

*HOWEVER*! :) If we do *not* see some short-but-together-long-enough
reads first, but see a full read at once, then our "pre-check" in
RECV_REPLY does not get activated at all! And then all the conditions in
CHECK_SIMPLE_OR_STRUCTURED_REPLY remain necessary.

So this is my long-winded way to ask that:
>    case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY);
>    }
>    return 0;
you please consider adding a comment here, saying that, while the short
read special case *does* short-circuit a number of checks under
CHECK_SIMPLE_OR_STRUCTURED_REPLY, all the checks in
CHECK_SIMPLE_OR_STRUCTURED_REPLY remain required if under RECV_REPLY we
get a full read immediately.

Thanks!
Laszlo

Eric Blake

2023-Jul-21 12:50 UTC

head link

[Libguestfs] [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies

On Thu, Jun 01, 2023 at 03:42:30PM +0200, Laszlo Ersek
wrote:> On 5/25/23 15:00, Eric Blake wrote:
> > One of the benefits of extended replies is that we can do a
> > fixed-length read for the entire header of every server reply, which
> > is fewer syscalls than the split-read approach required by structured
> > replies.
> 
> (Totally tangential comment: recvmsg() could scatter the incoming
> traffic into non-contiguous fields of a non-packed structure. But I
> don't know if TLS has anything similar. The current "linear
receive"
> approach is probably the least demanding of the underlying socket
> abstractions. At the same time it requires us to use packed structs,
> and/or multiple syscalls.)
> 
> > But one of the drawbacks of doing a large read is that if
> > the server is non-compliant (not a problem for normal servers, but
> > something I hit rather more than I'd like to admit while
developing
> > extended header support in servers),
> 
> Haha, so this is where it's coming from! :) I can totally relate.
> 
> > nbd_pwrite() and friends will
> > deadlock if the server replies with the wrong header.  Add in some
> > code to catch that failure mode and move the state machine to DEAD
> > sooner, to make it easier to diagnose the fault in the server.
> > 
> > Unlike in the case of an unexpected simply reply from a structured
> 
> (1) s/simply/simple/
Yep.
> 
> > server (where we never over-read, and can therefore
> 
> (2) At this point my English parser gets thrown off:
> 
> > commit b31e7bac
> > can merely fail the command with EPROTO and successfully move on to
> > the next server reply),
Fixing the parenthetical as follows:

(where, since commit b31e7bac, we can merely fail the command with
EPROTO and successfully move on to the next server reply, because we
didn't read too many bytes yet)
> 
> resync here:
> 
> > in this case we really do have to move to
> > DEAD: in addition to having already read the 16 or 20 bytes that the
> > server sent in its (short) reply for this command, we may have already
> > read the initial bytes of the server's next reply, but we have no
way
> > to push those extra bytes back onto our read stream for parsing on our
> > next pass through the state machine.
> > 
> > Signed-off-by: Eric Blake <eblake at redhat.com>
> > ---
> >  generator/states-reply.c | 23 ++++++++++++++++++++++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/generator/states-reply.c b/generator/states-reply.c
> > index 4e9f2dde..d4710d91 100644
> > --- a/generator/states-reply.c
> > +++ b/generator/states-reply.c
> > @@ -109,7 +109,28 @@  REPLY.START:
> >   REPLY.RECV_REPLY:
> >    switch (recv_into_rbuf (h)) {
> >    case -1: SET_NEXT_STATE (%.DEAD); return 0;
> > -  case 1: SET_NEXT_STATE (%.READY); return 0;
> > +  case 1: SET_NEXT_STATE (%.READY);
> > +    /* Special case: if we have a short read, but got at least far
> > +     * enough to decode the magic number, we can check if the server
> > +     * is matching our expectations. This lets us avoid deadlocking
if
> > +     * a buggy server sends only 16 bytes of a simple reply, and is
> > +     * waiting for our next command, while we are blocked waiting for
> > +     * the server to send 32 bytes of an extended reply.
> > +     */
> 
> (4) Slight inconsistency between commit message and code comment: the
> former mentions "20 bytes", but the latter doesn't mention
"structured
> reply".
Did I miss (3)?  But yes, I can improve this comment.
> 
> > +    if (h->extended_headers &&
> > +        (char *)h->rbuf >= (char
*)&h->sbuf.reply.hdr.extended.flags) {
> 
> (.) I wonder if (address-of-magic + size-of magic) might be more
> readable / greppable. Just in case.
> 
> Feel free to ignore.
Actually, I agree that it is nicer (although a bit longer).
> 
> > +      uint32_t magic = be32toh (h->sbuf.reply.hdr.extended.magic);
> > +      if (magic != NBD_EXTENDED_REPLY_MAGIC) {
> > +        SET_NEXT_STATE (%.DEAD); /* We've probably lost
synchronization. */
> > +        set_error (0, "invalid or unexpected reply magic
0x%" PRIx32, magic);
> > +#if 0 /* uncomment to see desynchronized data */
> > +        nbd_internal_hexdump
(&h->sbuf.reply.hdr.extended.flags,
> > +                              sizeof
(h->sbuf.reply.hdr.extended.flags),
> > +                              stderr);
> > +#endif
> > +      }
> > +    }
> > +    return 0;
> 
> (5) This could be factored out to a helper function, to share the
> "invalid:" label logic with the previous patch.
> 
> (6) Commencing a dump from "flags" onward looks questionable. At
this
> point, the "flags" field need not to be complete, or even started
-- all
> we know is that the "magic" field *before" "flags"
is complete.
> 
> I think we should dump "h->sbuf.reply.hdr.simple", like in
patch#5.
Yep, I noticed that while addressing your point (4), just before
reading your point (6).  Bug dumping the full hdr.simple is also too
much; really, given the preconditions above, all we can dump is
hdr.magic.  At which point, the hexdump is not providing us any
further information than what the set_error() call already output.
I'm just deleting the #if 0 segment, which in turn eliminates the need
to address point (5), as the two invalid: labels are no longer
identical.
> 
> (.) Side comment (so no bullet number assigned): because we can take
> multiple iterations on RECV_REPLY, we may end up checking the
"magic"
> field multiple times. Not very costly, just something to be aware of.
Indeed.  In the common case, short reads are rare, and even when they
do happen, it is even more rare to hit it more than once per packet.
I wonder if nc or similar has a way to force a server's response to be
flushed after every byte, to see the performance impact of maximum
network overhead.

But your observation means you are correctly aware of a larger design
aspect of the state machine: all code handling short reads (case 1 of
recv_into_rbuf) must be re-startable (no messing with h->rbuf, for
example).  Stateful changes can only occur when the read is complete
(case 0) or irretrievably failed (case -1).
> 
> (7) Assume that we have a short read first, but then complete
> REPLY.RECV_REPLY successfully, and move to
> REPLY.CHECK_SIMPLE_OR_STRUCTURED_REPLY.
...> 
> *HOWEVER*! :) If we do *not* see some short-but-together-long-enough
> reads first, but see a full read at once, then our "pre-check" in
> RECV_REPLY does not get activated at all! And then all the conditions in
> CHECK_SIMPLE_OR_STRUCTURED_REPLY remain necessary.
> 
> So this is my long-winded way to ask that:
> 
> >    case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY);
> >    }
> >    return 0;
> 
> you please consider adding a comment here, saying that, while the short
> read special case *does* short-circuit a number of checks under
> CHECK_SIMPLE_OR_STRUCTURED_REPLY, all the checks in
> CHECK_SIMPLE_OR_STRUCTURED_REPLY remain required if under RECV_REPLY we
> get a full read immediately.
Good idea.  Added for v4.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Libguestfs - Jul 2023 - [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies

[Libguestfs] [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies

[Libguestfs] [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies