thr3ads.net - Libguestfs - [Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption [Feb 2022]

If this information is useful, please help other people find it:
Share via:

Laszlo Ersek

2022-Feb-02 08:23 UTC

[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

On 02/01/22 20:44, Eric Blake wrote:> These are the trivial patches modeled after Nir's commit 7ba6ef67,
> fixing further places that fail to check *error during callbacks.
I find these interfaces very difficult to use.

https://libguestfs.org/nbd_aio_block_status.3.html

We have two callbacks here, each takes a *error, and I have zero idea of
the order the callbacks are supposed to be called in.

The last paragraph of the commit message speaks about this, and the code
updates seem to suggest some ordering, but I'm still unsure.

... In fact, I'm now uncertain whether ignoring "err" in
virt-v2v's
"lib/utils.ml", function "get_disk_allocated", is safe.
It's a "sync"
(not aio) nbd_block_status() call, but the documentation
<https://libguestfs.org/nbd_block_status.3.html> says:

"On entry to the callback, the error parameter contains the errno value
of any previously detected error."

So what should we do then, in "get_disk_allocated"? Return -1
immediately? And if we do so, what is that going to mean for
"NBD.block_status"? Will it raise an exception?

Thanks,
Laszlo
> nbdcopy still has a bigger bug in that it is NOT properly checking for
> *error in the pread/pwrite completion handlers, but fixing that
> gracefully (rather than just quitting the process immediately, as we
> can do in the examples), requires more invasive patches which I will
> be posting separately for review, while the ones posted here are
> trivial enough that I'm pushing them now (commits 23bcea4a..1373d423).
> 
> I also audited other uses of completion callbacks:
> - in tests, closure-lifetimes, errors, server-death, and
>   shutdown-flags are all safe (either the callback returns 0 and the
>   code later calls nbd_aio_command_completed, or we are explicitly
>   testing aspects of completion callbacks)
> - in fuse/operations.c, the completion callback returns 0 and calls
>   nbd_aio_command_completed later
> - in examples/strict-structured-reads.c, the completion callback
>   properly checks *error
> 
> I also checked that nbdkit's nbd plugin properly checks *error.
> 
> Eric Blake (4):
>   examples: aio-connect-read.c: Fix error handling
>   examples: glib-main-loop: don't strand commands when gsource
>     disappears
>   examples: glib-main-loop: Fix error handling
>   copy: Ignore extents if we encountered earlier error
> 
>  examples/aio-connect-read.c |  7 +++++++
>  examples/glib-main-loop.c   | 14 ++++++++++++--
>  copy/nbd-ops.c              |  4 ++--
>  3 files changed, 21 insertions(+), 4 deletions(-)
>

Laszlo Ersek

2022-Feb-02 08:59 UTC

head link

[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

On 02/02/22 09:23, Laszlo Ersek wrote:> On 02/01/22 20:44, Eric Blake wrote:
>> These are the trivial patches modeled after Nir's commit 7ba6ef67,
>> fixing further places that fail to check *error during callbacks.
>
> I find these interfaces very difficult to use.
>
> https://libguestfs.org/nbd_aio_block_status.3.html
>
> We have two callbacks here, each takes a *error, and I have zero idea of
> the order the callbacks are supposed to be called in.
>
> The last paragraph of the commit message speaks about this, and the code
> updates seem to suggest some ordering, but I'm still unsure.
>
> ... In fact, I'm now uncertain whether ignoring "err" in
virt-v2v's
> "lib/utils.ml", function "get_disk_allocated", is safe.
It's a "sync"
> (not aio) nbd_block_status() call, but the documentation
> <https://libguestfs.org/nbd_block_status.3.html> says:
>
> "On entry to the callback, the error parameter contains the errno
value
> of any previously detected error."
>
> So what should we do then, in "get_disk_allocated"? Return -1
> immediately?
>From patch#4, it seems that we need exactly this. Here's the function:
let get_disk_allocated ~dir ~disknr   let socket = sprintf "%s/out%d"
dir disknr
  and alloc_ctx = "base:allocation" in
  with_nbd_connect_unix ~socket ~meta_contexts:[alloc_ctx]
    (fun nbd ->
         if NBD.can_meta_context nbd alloc_ctx then (
           (* Get the list of extents, using a 2GiB chunk size as hint. *)
           let size = NBD.get_size nbd
           and allocated = ref 0_L
           and fetch_offset = ref 0_L in
           while !fetch_offset < size do
             let remaining = size -^ !fetch_offset in
             let fetch_size = min 0x8000_0000_L remaining in
             NBD.block_status nbd fetch_size !fetch_offset
               (fun ctx offset entries err ->
                  assert (ctx = alloc_ctx);
                  for i = 0 to Array.length entries / 2 - 1 do
                    let len = entries.(i * 2)
                    and typ = entries.(i * 2 + 1) in
                    assert (len > 0_L);
                    if typ &^ 1_L = 0_L then
                      allocated := !allocated +^ len;
                    fetch_offset := !fetch_offset +^ len
                  done;
                  0
               )
           done;
           Some !allocated
         ) else None
       )

If our (nameless) function is entered with a nonzero "err", then we
cannot trust the "entries" array. Which means not only
"allocated" may
be miscalculated, but even "fetch_offset" -- and the latter can as
well
break the chunking (= outer) loop. This seems to imply we *must not*
continue the outer loop.

So how do we handle this in OCaml? If we return -1 from the (nameless)
extents callback, will NBD.block_status raise an exception?

Hmmmm: maybe. In the generated nbd_internal_ocaml_nbd_block_status()
function, in file "ocaml/nbd-c.c", we have:

  caml_enter_blocking_section ();
  r =  nbd_block_status (h, count, offset, extent_callback, flags);
  caml_leave_blocking_section ();

  if (r == -1)
    nbd_internal_ocaml_raise_error ();

However, in extent_wrapper() -> extent_wrapper_locked(), there is more
confusion for me:

  metacontextv = caml_copy_string (metacontext);
  offsetv = caml_copy_int64 (offset);
  entriesv = nbd_internal_ocaml_alloc_int64_from_uint32_array (entries,
nr_entries);
  errorv = caml_alloc_tuple (1);
  Store_field (errorv, 0, Val_int (*error));
  args[0] = metacontextv;
  args[1] = offsetv;
  args[2] = entriesv;
  args[3] = errorv;
  rv = caml_callbackN_exn (data->fnv, 4, args);
  *error = Int_val (Field (errorv, 0));
  if (Is_exception_result (rv)) {
    nbd_internal_ocaml_exception_in_wrapper ("extent", rv);
    CAMLreturnT (int, -1);
  }

  r = Int_val (rv);
  assert (r >= 0);
  CAMLreturnT (int, r);

If I understand correctly, if we returned (-1) from the (nameless)
extents callback, that would trip the assert(); would it not? Thus, we
would not reach the nbd_internal_ocaml_raise_error() in the above-quoted
nbd_internal_ocaml_nbd_block_status() function.

Thanks,
Laszlo

> And if we do so, what is that going to mean for
> "NBD.block_status"? Will it raise an exception?
>
> Thanks,
> Laszlo
>
>> nbdcopy still has a bigger bug in that it is NOT properly checking for
>> *error in the pread/pwrite completion handlers, but fixing that
>> gracefully (rather than just quitting the process immediately, as we
>> can do in the examples), requires more invasive patches which I will
>> be posting separately for review, while the ones posted here are
>> trivial enough that I'm pushing them now (commits
23bcea4a..1373d423).
>>
>> I also audited other uses of completion callbacks:
>> - in tests, closure-lifetimes, errors, server-death, and
>>   shutdown-flags are all safe (either the callback returns 0 and the
>>   code later calls nbd_aio_command_completed, or we are explicitly
>>   testing aspects of completion callbacks)
>> - in fuse/operations.c, the completion callback returns 0 and calls
>>   nbd_aio_command_completed later
>> - in examples/strict-structured-reads.c, the completion callback
>>   properly checks *error
>>
>> I also checked that nbdkit's nbd plugin properly checks *error.
>>
>> Eric Blake (4):
>>   examples: aio-connect-read.c: Fix error handling
>>   examples: glib-main-loop: don't strand commands when gsource
>>     disappears
>>   examples: glib-main-loop: Fix error handling
>>   copy: Ignore extents if we encountered earlier error
>>
>>  examples/aio-connect-read.c |  7 +++++++
>>  examples/glib-main-loop.c   | 14 ++++++++++++--
>>  copy/nbd-ops.c              |  4 ++--
>>  3 files changed, 21 insertions(+), 4 deletions(-)
>>
>

Richard W.M. Jones

2022-Feb-02 10:43 UTC

head link

[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

On Wed, Feb 02, 2022 at 09:23:09AM +0100, Laszlo Ersek
wrote:> On 02/01/22 20:44, Eric Blake wrote:
> > These are the trivial patches modeled after Nir's commit 7ba6ef67,
> > fixing further places that fail to check *error during callbacks.
> 
> I find these interfaces very difficult to use.
> 
> https://libguestfs.org/nbd_aio_block_status.3.html
> 
> We have two callbacks here, each takes a *error, and I have zero idea of
> the order the callbacks are supposed to be called in.
The documentation is not great, but I can tell you that they are
called in the order:

 - extent_cb (usually once, but in theory several times)
 - completion_cb (once at the end)

The *error parameter is also not well documented, but the behaviour is:

 - There is a single error value stored for the asynch block status
   command (cmd->error).

 - *error is initialized from cmd->error on entry to the callback,
   ie. *error != 0 if an error has been reported by a previous
   callback.

 - If the callback sets *error != 0 _and_ returns -1 then cmd->error
   is updated.

 - The callback can also fail by returning -1 and not setting *error,
   in which case cmd->error is set to EPROTO (but don't do this as
   it means your program is buggy - but see below about OCaml).

 - Since there's only one error value, updating it overwrites
   previous values.

 - Returning -1 doesn't stop future callbacks, because basically we're
   pulling data off the wire and there's no way to tell the server to
   stop.  So extent and completion callbacks are still called.

 - Don't forget that its not just callbacks that can fail, commands
   can fail for other reasons, eg. network errors.  In this case the
   whole connection will drop into what we call the %DEAD state
   (because NBD can't resynchronize).  Every call returns -1 and you
   have to close and reopen the connection.  It's possible to
   explicitly test for this (nbd_aio_is_dead).

https://gitlab.com/nbdkit/libnbd/-/blob/1373d423f5ac42255fe606d2a7ab1caf18a51ded/generator/states-reply-structured.c#L455
https://gitlab.com/nbdkit/libnbd/-/blob/1373d423f5ac42255fe606d2a7ab1caf18a51ded/generator/states-reply.c#L151

To find out if a command had an error you can do either (but not both):

 - Check for *error != 0 in completion_cb.  Return 1 from
   completion_cb which auto-retires the command.

 - Call nbd_aio_command_completed which will return -1 and
   then call nbd_get_errno which will return cmd->error.  This
   also retires the command.  (Note in this case you also have
   to deal with nbd_aio_command_completed returning 0, meaning
   the command has not completed, or 1, meaning the command
   completed without error).

The first method is what nbdcopy should be doing but isn't.
> The last paragraph of the commit message speaks about this, and the code
> updates seem to suggest some ordering, but I'm still unsure.
> 
> ... In fact, I'm now uncertain whether ignoring "err" in
virt-v2v's
> "lib/utils.ml", function "get_disk_allocated", is safe.
It's a "sync"
> (not aio) nbd_block_status() call, but the documentation
> <https://libguestfs.org/nbd_block_status.3.html> says:
> 
> "On entry to the callback, the error parameter contains the errno
value
> of any previously detected error."
The above applies to synch commands as well, with the difference that:

 - There's no completion callback (because there's no need for it in a
   synch call).

 - Once the command completes, it is retired and the error status is
   copied to nbd_get_errno.

 - The synch command returns -1 if there was an error or 0 if successful.

See lib/rw.c: wait_for_command and nbd_unlocked_block_status.

So in answer to your question, virt-v2v can ignore !err, unless there
is a reason why you want to know if previous callbacks encountered an
error (maybe you want to save some work).  You'll still get an
exception thrown by NBD.block_status.
> So what should we do then, in "get_disk_allocated"? Return -1
> immediately? And if we do so, what is that going to mean for
> "NBD.block_status"? Will it raise an exception?
All OCaml wrappers check for the underlying nbd_* call failing and
turn that into an exception which carries the errno and error string.
See ocaml/nbd-c.c and ocaml/helpers.c:nbd_internal_ocaml_raise_error

In addition, when calling back into OCaml code (ie. into the extent
callback), if the callback throws an exception then that is
automatically turned into setting *error from the err reference and
returning -1.

OCaml code ought to do this when encountering an error inside the
callback:

  if some_error then (
    err := errno;
    raise AnException
  )

But there are a number of difficulties with that in practice:

 - OCaml functions that you call might raise exceptions which you
   don't catch, so you don't end up setting !err on all exceptions.

 - OCaml code doesn't have access to C errno values!

I wouldn't worry too much about this.  What happens in practice is
that nbd_internal_ocaml_exception_in_wrapper prints this on stderr:

  libnbd: <operation>: uncaught OCaml exception: <exception name>

cmd->error is set to EPROTO and the function as a whole throws some
NBD.Error exception.  The error doesn't get lost in this case, it's
just that the output is not ideal.

See ocaml/nbd-c.c:extent_wrapper_locked and
ocaml/helpers.c:nbd_internal_ocaml_exception_in_wrapper

Rich.
> Thanks,
> Laszlo
> 
> > nbdcopy still has a bigger bug in that it is NOT properly checking for
> > *error in the pread/pwrite completion handlers, but fixing that
> > gracefully (rather than just quitting the process immediately, as we
> > can do in the examples), requires more invasive patches which I will
> > be posting separately for review, while the ones posted here are
> > trivial enough that I'm pushing them now (commits
23bcea4a..1373d423).
> > 
> > I also audited other uses of completion callbacks:
> > - in tests, closure-lifetimes, errors, server-death, and
> >   shutdown-flags are all safe (either the callback returns 0 and the
> >   code later calls nbd_aio_command_completed, or we are explicitly
> >   testing aspects of completion callbacks)
> > - in fuse/operations.c, the completion callback returns 0 and calls
> >   nbd_aio_command_completed later
> > - in examples/strict-structured-reads.c, the completion callback
> >   properly checks *error
> > 
> > I also checked that nbdkit's nbd plugin properly checks *error.
> > 
> > Eric Blake (4):
> >   examples: aio-connect-read.c: Fix error handling
> >   examples: glib-main-loop: don't strand commands when gsource
> >     disappears
> >   examples: glib-main-loop: Fix error handling
> >   copy: Ignore extents if we encountered earlier error
> > 
> >  examples/aio-connect-read.c |  7 +++++++
> >  examples/glib-main-loop.c   | 14 ++++++++++++--
> >  copy/nbd-ops.c              |  4 ++--
> >  3 files changed, 21 insertions(+), 4 deletions(-)
> > 
-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Libguestfs - Feb 2022 - [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption