Laszlo Ersek
2022-Feb-02 08:59 UTC
[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption
On 02/02/22 09:23, Laszlo Ersek wrote:> On 02/01/22 20:44, Eric Blake wrote: >> These are the trivial patches modeled after Nir's commit 7ba6ef67, >> fixing further places that fail to check *error during callbacks. > > I find these interfaces very difficult to use. > > https://libguestfs.org/nbd_aio_block_status.3.html > > We have two callbacks here, each takes a *error, and I have zero idea of > the order the callbacks are supposed to be called in. > > The last paragraph of the commit message speaks about this, and the code > updates seem to suggest some ordering, but I'm still unsure. > > ... In fact, I'm now uncertain whether ignoring "err" in virt-v2v's > "lib/utils.ml", function "get_disk_allocated", is safe. It's a "sync" > (not aio) nbd_block_status() call, but the documentation > <https://libguestfs.org/nbd_block_status.3.html> says: > > "On entry to the callback, the error parameter contains the errno value > of any previously detected error." > > So what should we do then, in "get_disk_allocated"? Return -1 > immediately?>From patch#4, it seems that we need exactly this. Here's the function:let get_disk_allocated ~dir ~disknr let socket = sprintf "%s/out%d" dir disknr and alloc_ctx = "base:allocation" in with_nbd_connect_unix ~socket ~meta_contexts:[alloc_ctx] (fun nbd -> if NBD.can_meta_context nbd alloc_ctx then ( (* Get the list of extents, using a 2GiB chunk size as hint. *) let size = NBD.get_size nbd and allocated = ref 0_L and fetch_offset = ref 0_L in while !fetch_offset < size do let remaining = size -^ !fetch_offset in let fetch_size = min 0x8000_0000_L remaining in NBD.block_status nbd fetch_size !fetch_offset (fun ctx offset entries err -> assert (ctx = alloc_ctx); for i = 0 to Array.length entries / 2 - 1 do let len = entries.(i * 2) and typ = entries.(i * 2 + 1) in assert (len > 0_L); if typ &^ 1_L = 0_L then allocated := !allocated +^ len; fetch_offset := !fetch_offset +^ len done; 0 ) done; Some !allocated ) else None ) If our (nameless) function is entered with a nonzero "err", then we cannot trust the "entries" array. Which means not only "allocated" may be miscalculated, but even "fetch_offset" -- and the latter can as well break the chunking (= outer) loop. This seems to imply we *must not* continue the outer loop. So how do we handle this in OCaml? If we return -1 from the (nameless) extents callback, will NBD.block_status raise an exception? Hmmmm: maybe. In the generated nbd_internal_ocaml_nbd_block_status() function, in file "ocaml/nbd-c.c", we have: caml_enter_blocking_section (); r = nbd_block_status (h, count, offset, extent_callback, flags); caml_leave_blocking_section (); if (r == -1) nbd_internal_ocaml_raise_error (); However, in extent_wrapper() -> extent_wrapper_locked(), there is more confusion for me: metacontextv = caml_copy_string (metacontext); offsetv = caml_copy_int64 (offset); entriesv = nbd_internal_ocaml_alloc_int64_from_uint32_array (entries, nr_entries); errorv = caml_alloc_tuple (1); Store_field (errorv, 0, Val_int (*error)); args[0] = metacontextv; args[1] = offsetv; args[2] = entriesv; args[3] = errorv; rv = caml_callbackN_exn (data->fnv, 4, args); *error = Int_val (Field (errorv, 0)); if (Is_exception_result (rv)) { nbd_internal_ocaml_exception_in_wrapper ("extent", rv); CAMLreturnT (int, -1); } r = Int_val (rv); assert (r >= 0); CAMLreturnT (int, r); If I understand correctly, if we returned (-1) from the (nameless) extents callback, that would trip the assert(); would it not? Thus, we would not reach the nbd_internal_ocaml_raise_error() in the above-quoted nbd_internal_ocaml_nbd_block_status() function. Thanks, Laszlo> And if we do so, what is that going to mean for > "NBD.block_status"? Will it raise an exception? > > Thanks, > Laszlo > >> nbdcopy still has a bigger bug in that it is NOT properly checking for >> *error in the pread/pwrite completion handlers, but fixing that >> gracefully (rather than just quitting the process immediately, as we >> can do in the examples), requires more invasive patches which I will >> be posting separately for review, while the ones posted here are >> trivial enough that I'm pushing them now (commits 23bcea4a..1373d423). >> >> I also audited other uses of completion callbacks: >> - in tests, closure-lifetimes, errors, server-death, and >> shutdown-flags are all safe (either the callback returns 0 and the >> code later calls nbd_aio_command_completed, or we are explicitly >> testing aspects of completion callbacks) >> - in fuse/operations.c, the completion callback returns 0 and calls >> nbd_aio_command_completed later >> - in examples/strict-structured-reads.c, the completion callback >> properly checks *error >> >> I also checked that nbdkit's nbd plugin properly checks *error. >> >> Eric Blake (4): >> examples: aio-connect-read.c: Fix error handling >> examples: glib-main-loop: don't strand commands when gsource >> disappears >> examples: glib-main-loop: Fix error handling >> copy: Ignore extents if we encountered earlier error >> >> examples/aio-connect-read.c | 7 +++++++ >> examples/glib-main-loop.c | 14 ++++++++++++-- >> copy/nbd-ops.c | 4 ++-- >> 3 files changed, 21 insertions(+), 4 deletions(-) >> >
Richard W.M. Jones
2022-Feb-02 10:55 UTC
[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption
On Wed, Feb 02, 2022 at 09:59:04AM +0100, Laszlo Ersek wrote:> However, in extent_wrapper() -> extent_wrapper_locked(), there is more > confusion for me: > > metacontextv = caml_copy_string (metacontext); > offsetv = caml_copy_int64 (offset); > entriesv = nbd_internal_ocaml_alloc_int64_from_uint32_array (entries, nr_entries); > errorv = caml_alloc_tuple (1); > Store_field (errorv, 0, Val_int (*error)); > args[0] = metacontextv; > args[1] = offsetv; > args[2] = entriesv; > args[3] = errorv; > rv = caml_callbackN_exn (data->fnv, 4, args); > *error = Int_val (Field (errorv, 0)); > if (Is_exception_result (rv)) { > nbd_internal_ocaml_exception_in_wrapper ("extent", rv); > CAMLreturnT (int, -1); > } > > r = Int_val (rv); > assert (r >= 0); > CAMLreturnT (int, r); > > If I understand correctly, if we returned (-1) from the (nameless) > extents callback, that would trip the assert(); would it not? Thus, we > would not reach the nbd_internal_ocaml_raise_error() in the above-quoted > nbd_internal_ocaml_nbd_block_status() function.There are five cases: - Callback returns > 1 => The program shouldn't do this, but it's treated the same as the next case. - Callback returns 1 => For completion callbacks this indicates that the callback was successful and you want to autoretire the command. For other types of callbacks the program shouldn't do this but it is the same as the next case. - Callback returns 0 => Indicates that the callback was successful. - Callback throws an exception => nbd_internal_ocaml_exception_in_wrapper is called => Callback returns -1 indicating (to libnbd) that there was an error. cmd->error is updated with !err || EPROTO. - Callback returns < 0 => assert fails! The assert is a bit harsh, but it indicates a bug in the OCaml code. It probably means the OCaml code thought that returning -1 indicates an error, whereas the OCaml code should throw an exception instead. For the history of why that assert was added see: https://gitlab.com/nbdkit/libnbd/-/commit/d881d160e1cd9c9964782300a7652ffb4e506c27 Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top