Laszlo Ersek
2022-Feb-02 08:23 UTC
[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption
On 02/01/22 20:44, Eric Blake wrote:> These are the trivial patches modeled after Nir's commit 7ba6ef67, > fixing further places that fail to check *error during callbacks.I find these interfaces very difficult to use. https://libguestfs.org/nbd_aio_block_status.3.html We have two callbacks here, each takes a *error, and I have zero idea of the order the callbacks are supposed to be called in. The last paragraph of the commit message speaks about this, and the code updates seem to suggest some ordering, but I'm still unsure. ... In fact, I'm now uncertain whether ignoring "err" in virt-v2v's "lib/utils.ml", function "get_disk_allocated", is safe. It's a "sync" (not aio) nbd_block_status() call, but the documentation <https://libguestfs.org/nbd_block_status.3.html> says: "On entry to the callback, the error parameter contains the errno value of any previously detected error." So what should we do then, in "get_disk_allocated"? Return -1 immediately? And if we do so, what is that going to mean for "NBD.block_status"? Will it raise an exception? Thanks, Laszlo> nbdcopy still has a bigger bug in that it is NOT properly checking for > *error in the pread/pwrite completion handlers, but fixing that > gracefully (rather than just quitting the process immediately, as we > can do in the examples), requires more invasive patches which I will > be posting separately for review, while the ones posted here are > trivial enough that I'm pushing them now (commits 23bcea4a..1373d423). > > I also audited other uses of completion callbacks: > - in tests, closure-lifetimes, errors, server-death, and > shutdown-flags are all safe (either the callback returns 0 and the > code later calls nbd_aio_command_completed, or we are explicitly > testing aspects of completion callbacks) > - in fuse/operations.c, the completion callback returns 0 and calls > nbd_aio_command_completed later > - in examples/strict-structured-reads.c, the completion callback > properly checks *error > > I also checked that nbdkit's nbd plugin properly checks *error. > > Eric Blake (4): > examples: aio-connect-read.c: Fix error handling > examples: glib-main-loop: don't strand commands when gsource > disappears > examples: glib-main-loop: Fix error handling > copy: Ignore extents if we encountered earlier error > > examples/aio-connect-read.c | 7 +++++++ > examples/glib-main-loop.c | 14 ++++++++++++-- > copy/nbd-ops.c | 4 ++-- > 3 files changed, 21 insertions(+), 4 deletions(-) >
Laszlo Ersek
2022-Feb-02 08:59 UTC
[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption
On 02/02/22 09:23, Laszlo Ersek wrote:> On 02/01/22 20:44, Eric Blake wrote: >> These are the trivial patches modeled after Nir's commit 7ba6ef67, >> fixing further places that fail to check *error during callbacks. > > I find these interfaces very difficult to use. > > https://libguestfs.org/nbd_aio_block_status.3.html > > We have two callbacks here, each takes a *error, and I have zero idea of > the order the callbacks are supposed to be called in. > > The last paragraph of the commit message speaks about this, and the code > updates seem to suggest some ordering, but I'm still unsure. > > ... In fact, I'm now uncertain whether ignoring "err" in virt-v2v's > "lib/utils.ml", function "get_disk_allocated", is safe. It's a "sync" > (not aio) nbd_block_status() call, but the documentation > <https://libguestfs.org/nbd_block_status.3.html> says: > > "On entry to the callback, the error parameter contains the errno value > of any previously detected error." > > So what should we do then, in "get_disk_allocated"? Return -1 > immediately?>From patch#4, it seems that we need exactly this. Here's the function:let get_disk_allocated ~dir ~disknr let socket = sprintf "%s/out%d" dir disknr and alloc_ctx = "base:allocation" in with_nbd_connect_unix ~socket ~meta_contexts:[alloc_ctx] (fun nbd -> if NBD.can_meta_context nbd alloc_ctx then ( (* Get the list of extents, using a 2GiB chunk size as hint. *) let size = NBD.get_size nbd and allocated = ref 0_L and fetch_offset = ref 0_L in while !fetch_offset < size do let remaining = size -^ !fetch_offset in let fetch_size = min 0x8000_0000_L remaining in NBD.block_status nbd fetch_size !fetch_offset (fun ctx offset entries err -> assert (ctx = alloc_ctx); for i = 0 to Array.length entries / 2 - 1 do let len = entries.(i * 2) and typ = entries.(i * 2 + 1) in assert (len > 0_L); if typ &^ 1_L = 0_L then allocated := !allocated +^ len; fetch_offset := !fetch_offset +^ len done; 0 ) done; Some !allocated ) else None ) If our (nameless) function is entered with a nonzero "err", then we cannot trust the "entries" array. Which means not only "allocated" may be miscalculated, but even "fetch_offset" -- and the latter can as well break the chunking (= outer) loop. This seems to imply we *must not* continue the outer loop. So how do we handle this in OCaml? If we return -1 from the (nameless) extents callback, will NBD.block_status raise an exception? Hmmmm: maybe. In the generated nbd_internal_ocaml_nbd_block_status() function, in file "ocaml/nbd-c.c", we have: caml_enter_blocking_section (); r = nbd_block_status (h, count, offset, extent_callback, flags); caml_leave_blocking_section (); if (r == -1) nbd_internal_ocaml_raise_error (); However, in extent_wrapper() -> extent_wrapper_locked(), there is more confusion for me: metacontextv = caml_copy_string (metacontext); offsetv = caml_copy_int64 (offset); entriesv = nbd_internal_ocaml_alloc_int64_from_uint32_array (entries, nr_entries); errorv = caml_alloc_tuple (1); Store_field (errorv, 0, Val_int (*error)); args[0] = metacontextv; args[1] = offsetv; args[2] = entriesv; args[3] = errorv; rv = caml_callbackN_exn (data->fnv, 4, args); *error = Int_val (Field (errorv, 0)); if (Is_exception_result (rv)) { nbd_internal_ocaml_exception_in_wrapper ("extent", rv); CAMLreturnT (int, -1); } r = Int_val (rv); assert (r >= 0); CAMLreturnT (int, r); If I understand correctly, if we returned (-1) from the (nameless) extents callback, that would trip the assert(); would it not? Thus, we would not reach the nbd_internal_ocaml_raise_error() in the above-quoted nbd_internal_ocaml_nbd_block_status() function. Thanks, Laszlo> And if we do so, what is that going to mean for > "NBD.block_status"? Will it raise an exception? > > Thanks, > Laszlo > >> nbdcopy still has a bigger bug in that it is NOT properly checking for >> *error in the pread/pwrite completion handlers, but fixing that >> gracefully (rather than just quitting the process immediately, as we >> can do in the examples), requires more invasive patches which I will >> be posting separately for review, while the ones posted here are >> trivial enough that I'm pushing them now (commits 23bcea4a..1373d423). >> >> I also audited other uses of completion callbacks: >> - in tests, closure-lifetimes, errors, server-death, and >> shutdown-flags are all safe (either the callback returns 0 and the >> code later calls nbd_aio_command_completed, or we are explicitly >> testing aspects of completion callbacks) >> - in fuse/operations.c, the completion callback returns 0 and calls >> nbd_aio_command_completed later >> - in examples/strict-structured-reads.c, the completion callback >> properly checks *error >> >> I also checked that nbdkit's nbd plugin properly checks *error. >> >> Eric Blake (4): >> examples: aio-connect-read.c: Fix error handling >> examples: glib-main-loop: don't strand commands when gsource >> disappears >> examples: glib-main-loop: Fix error handling >> copy: Ignore extents if we encountered earlier error >> >> examples/aio-connect-read.c | 7 +++++++ >> examples/glib-main-loop.c | 14 ++++++++++++-- >> copy/nbd-ops.c | 4 ++-- >> 3 files changed, 21 insertions(+), 4 deletions(-) >> >
Richard W.M. Jones
2022-Feb-02 10:43 UTC
[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption
On Wed, Feb 02, 2022 at 09:23:09AM +0100, Laszlo Ersek wrote:> On 02/01/22 20:44, Eric Blake wrote: > > These are the trivial patches modeled after Nir's commit 7ba6ef67, > > fixing further places that fail to check *error during callbacks. > > I find these interfaces very difficult to use. > > https://libguestfs.org/nbd_aio_block_status.3.html > > We have two callbacks here, each takes a *error, and I have zero idea of > the order the callbacks are supposed to be called in.The documentation is not great, but I can tell you that they are called in the order: - extent_cb (usually once, but in theory several times) - completion_cb (once at the end) The *error parameter is also not well documented, but the behaviour is: - There is a single error value stored for the asynch block status command (cmd->error). - *error is initialized from cmd->error on entry to the callback, ie. *error != 0 if an error has been reported by a previous callback. - If the callback sets *error != 0 _and_ returns -1 then cmd->error is updated. - The callback can also fail by returning -1 and not setting *error, in which case cmd->error is set to EPROTO (but don't do this as it means your program is buggy - but see below about OCaml). - Since there's only one error value, updating it overwrites previous values. - Returning -1 doesn't stop future callbacks, because basically we're pulling data off the wire and there's no way to tell the server to stop. So extent and completion callbacks are still called. - Don't forget that its not just callbacks that can fail, commands can fail for other reasons, eg. network errors. In this case the whole connection will drop into what we call the %DEAD state (because NBD can't resynchronize). Every call returns -1 and you have to close and reopen the connection. It's possible to explicitly test for this (nbd_aio_is_dead). https://gitlab.com/nbdkit/libnbd/-/blob/1373d423f5ac42255fe606d2a7ab1caf18a51ded/generator/states-reply-structured.c#L455 https://gitlab.com/nbdkit/libnbd/-/blob/1373d423f5ac42255fe606d2a7ab1caf18a51ded/generator/states-reply.c#L151 To find out if a command had an error you can do either (but not both): - Check for *error != 0 in completion_cb. Return 1 from completion_cb which auto-retires the command. - Call nbd_aio_command_completed which will return -1 and then call nbd_get_errno which will return cmd->error. This also retires the command. (Note in this case you also have to deal with nbd_aio_command_completed returning 0, meaning the command has not completed, or 1, meaning the command completed without error). The first method is what nbdcopy should be doing but isn't.> The last paragraph of the commit message speaks about this, and the code > updates seem to suggest some ordering, but I'm still unsure. > > ... In fact, I'm now uncertain whether ignoring "err" in virt-v2v's > "lib/utils.ml", function "get_disk_allocated", is safe. It's a "sync" > (not aio) nbd_block_status() call, but the documentation > <https://libguestfs.org/nbd_block_status.3.html> says: > > "On entry to the callback, the error parameter contains the errno value > of any previously detected error."The above applies to synch commands as well, with the difference that: - There's no completion callback (because there's no need for it in a synch call). - Once the command completes, it is retired and the error status is copied to nbd_get_errno. - The synch command returns -1 if there was an error or 0 if successful. See lib/rw.c: wait_for_command and nbd_unlocked_block_status. So in answer to your question, virt-v2v can ignore !err, unless there is a reason why you want to know if previous callbacks encountered an error (maybe you want to save some work). You'll still get an exception thrown by NBD.block_status.> So what should we do then, in "get_disk_allocated"? Return -1 > immediately? And if we do so, what is that going to mean for > "NBD.block_status"? Will it raise an exception?All OCaml wrappers check for the underlying nbd_* call failing and turn that into an exception which carries the errno and error string. See ocaml/nbd-c.c and ocaml/helpers.c:nbd_internal_ocaml_raise_error In addition, when calling back into OCaml code (ie. into the extent callback), if the callback throws an exception then that is automatically turned into setting *error from the err reference and returning -1. OCaml code ought to do this when encountering an error inside the callback: if some_error then ( err := errno; raise AnException ) But there are a number of difficulties with that in practice: - OCaml functions that you call might raise exceptions which you don't catch, so you don't end up setting !err on all exceptions. - OCaml code doesn't have access to C errno values! I wouldn't worry too much about this. What happens in practice is that nbd_internal_ocaml_exception_in_wrapper prints this on stderr: libnbd: <operation>: uncaught OCaml exception: <exception name> cmd->error is set to EPROTO and the function as a whole throws some NBD.Error exception. The error doesn't get lost in this case, it's just that the output is not ideal. See ocaml/nbd-c.c:extent_wrapper_locked and ocaml/helpers.c:nbd_internal_ocaml_exception_in_wrapper Rich.> Thanks, > Laszlo > > > nbdcopy still has a bigger bug in that it is NOT properly checking for > > *error in the pread/pwrite completion handlers, but fixing that > > gracefully (rather than just quitting the process immediately, as we > > can do in the examples), requires more invasive patches which I will > > be posting separately for review, while the ones posted here are > > trivial enough that I'm pushing them now (commits 23bcea4a..1373d423). > > > > I also audited other uses of completion callbacks: > > - in tests, closure-lifetimes, errors, server-death, and > > shutdown-flags are all safe (either the callback returns 0 and the > > code later calls nbd_aio_command_completed, or we are explicitly > > testing aspects of completion callbacks) > > - in fuse/operations.c, the completion callback returns 0 and calls > > nbd_aio_command_completed later > > - in examples/strict-structured-reads.c, the completion callback > > properly checks *error > > > > I also checked that nbdkit's nbd plugin properly checks *error. > > > > Eric Blake (4): > > examples: aio-connect-read.c: Fix error handling > > examples: glib-main-loop: don't strand commands when gsource > > disappears > > examples: glib-main-loop: Fix error handling > > copy: Ignore extents if we encountered earlier error > > > > examples/aio-connect-read.c | 7 +++++++ > > examples/glib-main-loop.c | 14 ++++++++++++-- > > copy/nbd-ops.c | 4 ++-- > > 3 files changed, 21 insertions(+), 4 deletions(-) > >-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW