thr3ads.net - Libguestfs - [Libguestfs] libnbd: When are callbacks freed [Jul 2023]

If this information is useful, please help other people find it:
Share via:

Tage Johansson

2023-Jul-13 16:18 UTC

[Libguestfs] libnbd: When are callbacks freed

On 7/13/2023 5:37 PM, Richard W.M. Jones wrote:> On Thu, Jul 13, 2023 at 03:05:30PM +0000, Tage Johansson wrote:
>> On 7/13/2023 4:36 PM, Eric Blake wrote:
>>> On Thu, Jul 13, 2023 at 01:37:49PM +0000, Tage Johansson wrote:
>>>> On 7/13/2023 3:26 PM, Richard W.M. Jones wrote:
>>>>> On Thu, Jul 13, 2023 at 08:01:09AM -0500, Eric Blake wrote:
>>>>>> On Thu, Jul 13, 2023 at 07:13:37AM -0500, Eric Blake
wrote:
>>>>>>>>> I have replaced a call to `nbd_opt_info()`
with a call to
>>>>>>>>> `nbd_aio_opt_info()` and passed in a
completion callback which just
>>>>>>>>> calls `exit(EXIT_FAILURE)`. So if the
completion callback is called
>>>>>>>>> the test should fail, which it doesn't,
at least not on my machine.
>>>>>>>> Isn't that OK?  Only .free is required to
be called.
>>>>>>> For the context callback (for opt_set_meta),
.callback may be called
>>>>>>> zero, one, or multiple times, but .free should be
called exactly once.
>>>>>>> But for the completion callback, I agree that the
docs state that both
>>>>>>> .callback and .free should each be called exactly
once ("On the other
>>>>>>> hand, if a completion callback is supplied (only
possible with
>>>>>>> asynchronous commands), it will always be reached
exactly once, and
>>>>>>> the completion callback must not ignore the value
pointed to by
>>>>>>> C<error>."); we are indeed missing the
call to .callback.  I'll work
>>>>>>> on a patch.
>>>>>> Eww, the bug is bigger than just nbd_aio_opt* not
always calling
>>>>>> completion.callback exactly once.  With just this diff
(to be folded
>>>>>> into the larger patch I'm working on), I'm
getting an assertion
>>>>>> failure that we fail to call completion.callback for
>>>>>> nbd_aio_pread_structured when calling nbd_close() prior
to the command
>>>>>> running to completion, so I'll have to fix that
too.
>>>>> Just to be clear about this, are we really sure the
completion
>>>>> callback should always be called once?  I'm not clear
why that should
>>>>> be the case.  (The .free callback however should be.)
>>>>>
>>>>> For example, if I call a function with bogus invalid
parameters so
>>>>> that it fails very early on (or when the handle is in an
incorrect
>>>>> state), should I expect the completion callback to be
called?  I would
>>>>> expect not.
>>>>>
>>>>> Rich.
>>>> The user needs a way to know if an error occurred. So the
completion
>>>> callback must be called if the asynchronous function did not
fail (returned
>>>> 0). If the completion callback should be called, with the error
parameter
>>>> set, even when the asynchronous function immediately failed
with a non-zero
>>>> return value is another question. I see two possibilities:
Either the
>>>> completion callback should always be called. Or it should be
called iff the
>>>> asynchronous function returned 0 (did not fail).
>>> Good point.
>>>
>>> Our documentation currently states (docs/libnbd.pod) that the
>>> completion callback is ALWAYS called, but this is inconsistent with
>>> current code - you are correct that at present, the completion
>>> callback is NOT called if the aio command returns non-zero (easy
test:
>>> call nbd_aio_block_status before nbd_connect*).  I'm game to
updating
>>> that part of the documentation to match existing practice (changing
>>> our guarantee to the completion callback will be called once iff
the
>>> aio command reported success), since our testsuite wasn't
covering it,
>>> and it is probably an easier fix than munging the generator to call
>>> completion.callback even for aio failures.  Meanwhile, the .free
>>> callback MUST be called unconditionally, and I think our testsuite
is
>>> already covering that.
>>>
>>> Life-cycle wise, I see the following sequence as being something we
>>> could usefully rely on (although we don't yet have enough
testsuite
>>> coverage to prove that we already have it).  Note that it is not
only
>>> important how many times things are called, but I would like it if
we
>>> can guarantee the specific ordering between them (neither .free
should
>>> be called until all .callback opportunities are complete finished,
to
>>> avoid any use-after-free issues regardless of which of the two
.free
>>> slots the programmer actually uses):
>>>
>>> - if aio command fails:
>>> mid-command .callback: 0 calls
>>> completion .callback: 0 calls
>>> mid-command .free: 1 call
>>> completion .free: 1 call
>>>
>>> - if aio command succeeds:
>>> mid-command .callback: 0, 1, or multiple times
>>> completion .callback: 1 call
>>> mid-command .free: 1 call
>>> completion .free: 1 call
>>>
>>> What I'm not sure of yet (without more code inspection) is
whether we
>>> can sometimes call completion.callback after mid-command.free.
>>>
>>>> By the way, if the error parameter is set in the completion
callback, how
>>>> can the user retrieve the error text? Is it possible to call
>>>> `get_nbd_error(3)` from inside the completion callback?
>>> You mean nbd_get_error().  We currently document that it is not
safe
>>> to call any nbd_* from within a callback.  Maybe we could relax
that
>>> to carve out exceptions for nbd_get_error/errno as special cases,
>>> since they only inspect thread-local state and can be used even
when
>>> there is no current nbd_handle.  The problem is that I'm not
sure if
>>> there is a reliable string at that point in time: part of the
reason
>>> the completion callback has an errnum parameter is that the point
of
>>> reporting the completion failure may be in a different thread than
>>> when the failure was first detected, and we only preserved a
numeric
>>> value in cmd->error rather than also preserving a string.
>>>
>>> Put another way, there is no way to guarantee that nbd_get_errno()
>>> will return the same value as the errnum parameter to the callback;
>>> and if that is true, then even if calling nbd_get_error()
doesn't
>>> crash or deadlock from within a callback, it is likewise probable
that
>>> any such string returned is not consistent with the errnum
parameter
>>> passed to the callback.
>>
>> So is there any safe way to get some description of the error from a
>> completion callback apart from a non-zero number? It isn't too
>> helpful to report to the user that the read operation faild with -1.
> As I recall, from the callback, no.
>
> The error is not lost however, it will be returned by the API call
> itself.  eg. If you're in nbd_aio_opt_list -> callback (error) then
> nbd_aio_opt_list will return -1 and at that point you can use
> nbd_get_error to report the error.

I don't understand. If I call `nbd_aio_opt_list()` with a completion 
callback. `nbd_aio_opt_list()` will return immediately, maybe with a 
successful result. But the command is not complete until 
`nbd_aio_is_connecting()` returns `false`, so the completion callback 
may be invoked with an error after `nbd_aio_opt_list()` has returned.


Also, does the value of `err` (passed into a caller) has any meaning 
like a known errno value or something else that can be converted to a 
description? Or is it just an arbitrary non zero integer?

> Rich.
>

Richard W.M. Jones

2023-Jul-13 16:46 UTC

head link

[Libguestfs] libnbd: When are callbacks freed

On Thu, Jul 13, 2023 at 04:18:03PM +0000, Tage Johansson
wrote:> 
> On 7/13/2023 5:37 PM, Richard W.M. Jones wrote:
> >On Thu, Jul 13, 2023 at 03:05:30PM +0000, Tage Johansson wrote:
> >>On 7/13/2023 4:36 PM, Eric Blake wrote:
> >>>On Thu, Jul 13, 2023 at 01:37:49PM +0000, Tage Johansson wrote:
> >>>>On 7/13/2023 3:26 PM, Richard W.M. Jones wrote:
> >>>>>On Thu, Jul 13, 2023 at 08:01:09AM -0500, Eric Blake
wrote:
> >>>>>>On Thu, Jul 13, 2023 at 07:13:37AM -0500, Eric
Blake wrote:
> >>>>>>>>>I have replaced a call to
`nbd_opt_info()` with a call to
> >>>>>>>>>`nbd_aio_opt_info()` and passed in a
completion callback which just
> >>>>>>>>>calls `exit(EXIT_FAILURE)`. So if the
completion callback is called
> >>>>>>>>>the test should fail, which it
doesn't, at least not on my machine.
> >>>>>>>>Isn't that OK?  Only .free is required
to be called.
> >>>>>>>For the context callback (for opt_set_meta),
.callback may be called
> >>>>>>>zero, one, or multiple times, but .free should
be called exactly once.
> >>>>>>>But for the completion callback, I agree that
the docs state that both
> >>>>>>>.callback and .free should each be called
exactly once ("On the other
> >>>>>>>hand, if a completion callback is supplied
(only possible with
> >>>>>>>asynchronous commands), it will always be
reached exactly once, and
> >>>>>>>the completion callback must not ignore the
value pointed to by
> >>>>>>>C<error>."); we are indeed missing
the call to .callback.  I'll work
> >>>>>>>on a patch.
> >>>>>>Eww, the bug is bigger than just nbd_aio_opt* not
always calling
> >>>>>>completion.callback exactly once.  With just this
diff (to be folded
> >>>>>>into the larger patch I'm working on), I'm
getting an assertion
> >>>>>>failure that we fail to call completion.callback
for
> >>>>>>nbd_aio_pread_structured when calling nbd_close()
prior to the command
> >>>>>>running to completion, so I'll have to fix that
too.
> >>>>>Just to be clear about this, are we really sure the
completion
> >>>>>callback should always be called once?  I'm not
clear why that should
> >>>>>be the case.  (The .free callback however should be.)
> >>>>>
> >>>>>For example, if I call a function with bogus invalid
parameters so
> >>>>>that it fails very early on (or when the handle is in
an incorrect
> >>>>>state), should I expect the completion callback to be
called?  I would
> >>>>>expect not.
> >>>>>
> >>>>>Rich.
> >>>>The user needs a way to know if an error occurred. So the
completion
> >>>>callback must be called if the asynchronous function did
not fail (returned
> >>>>0). If the completion callback should be called, with the
error parameter
> >>>>set, even when the asynchronous function immediately failed
with a non-zero
> >>>>return value is another question. I see two possibilities:
Either the
> >>>>completion callback should always be called. Or it should
be called iff the
> >>>>asynchronous function returned 0 (did not fail).
> >>>Good point.
> >>>
> >>>Our documentation currently states (docs/libnbd.pod) that the
> >>>completion callback is ALWAYS called, but this is inconsistent
with
> >>>current code - you are correct that at present, the completion
> >>>callback is NOT called if the aio command returns non-zero
(easy test:
> >>>call nbd_aio_block_status before nbd_connect*).  I'm game
to updating
> >>>that part of the documentation to match existing practice
(changing
> >>>our guarantee to the completion callback will be called once
iff the
> >>>aio command reported success), since our testsuite wasn't
covering it,
> >>>and it is probably an easier fix than munging the generator to
call
> >>>completion.callback even for aio failures.  Meanwhile, the
.free
> >>>callback MUST be called unconditionally, and I think our
testsuite is
> >>>already covering that.
> >>>
> >>>Life-cycle wise, I see the following sequence as being
something we
> >>>could usefully rely on (although we don't yet have enough
testsuite
> >>>coverage to prove that we already have it).  Note that it is
not only
> >>>important how many times things are called, but I would like it
if we
> >>>can guarantee the specific ordering between them (neither .free
should
> >>>be called until all .callback opportunities are complete
finished, to
> >>>avoid any use-after-free issues regardless of which of the two
.free
> >>>slots the programmer actually uses):
> >>>
> >>>- if aio command fails:
> >>>mid-command .callback: 0 calls
> >>>completion .callback: 0 calls
> >>>mid-command .free: 1 call
> >>>completion .free: 1 call
> >>>
> >>>- if aio command succeeds:
> >>>mid-command .callback: 0, 1, or multiple times
> >>>completion .callback: 1 call
> >>>mid-command .free: 1 call
> >>>completion .free: 1 call
> >>>
> >>>What I'm not sure of yet (without more code inspection) is
whether we
> >>>can sometimes call completion.callback after mid-command.free.
> >>>
> >>>>By the way, if the error parameter is set in the completion
callback, how
> >>>>can the user retrieve the error text? Is it possible to
call
> >>>>`get_nbd_error(3)` from inside the completion callback?
> >>>You mean nbd_get_error().  We currently document that it is not
safe
> >>>to call any nbd_* from within a callback.  Maybe we could relax
that
> >>>to carve out exceptions for nbd_get_error/errno as special
cases,
> >>>since they only inspect thread-local state and can be used even
when
> >>>there is no current nbd_handle.  The problem is that I'm
not sure if
> >>>there is a reliable string at that point in time: part of the
reason
> >>>the completion callback has an errnum parameter is that the
point of
> >>>reporting the completion failure may be in a different thread
than
> >>>when the failure was first detected, and we only preserved a
numeric
> >>>value in cmd->error rather than also preserving a string.
> >>>
> >>>Put another way, there is no way to guarantee that
nbd_get_errno()
> >>>will return the same value as the errnum parameter to the
callback;
> >>>and if that is true, then even if calling nbd_get_error()
doesn't
> >>>crash or deadlock from within a callback, it is likewise
probable that
> >>>any such string returned is not consistent with the errnum
parameter
> >>>passed to the callback.
> >>
> >>So is there any safe way to get some description of the error from
a
> >>completion callback apart from a non-zero number? It isn't too
> >>helpful to report to the user that the read operation faild with
-1.
> >As I recall, from the callback, no.
> >
> >The error is not lost however, it will be returned by the API call
> >itself.  eg. If you're in nbd_aio_opt_list -> callback (error)
then
> >nbd_aio_opt_list will return -1 and at that point you can use
> >nbd_get_error to report the error.
> 
> 
> I don't understand. If I call `nbd_aio_opt_list()` with a completion
> callback. `nbd_aio_opt_list()` will return immediately, maybe with a
> successful result. But the command is not complete until
> `nbd_aio_is_connecting()` returns `false`, so the completion
> callback may be invoked with an error after `nbd_aio_opt_list()` has
> returned.
It's returned by some later call, such as nbd_aio_notify_read.

Basically libnbd doesn't create threads, so the only place that
callbacks can be called is when you call into libnbd.  That function
that you call is what will return an error.
> Also, does the value of `err` (passed into a caller) has any meaning
> like a known errno value or something else that can be converted to
> a description? Or is it just an arbitrary non zero integer?
It's a C errno number.  Setting it will change what nbd_get_errno
returns.

In the OCaml bindings we actually provide a function to translate
OCaml <-> C errnos for this purpose:

  val errno_of_unix_error : Unix.error -> int
  (** Return the raw C errno corresponding to a {!Unix.error}.  This
      can be used in callbacks to update the [int ref] parameter. *)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

Eric Blake

2023-Jul-13 18:37 UTC

head link

[Libguestfs] libnbd: When are callbacks freed

On Thu, Jul 13, 2023 at 04:18:03PM +0000, Tage Johansson
wrote:> 
> > > So is there any safe way to get some description of the error
from a
> > > completion callback apart from a non-zero number? It isn't
too
> > > helpful to report to the user that the read operation faild with
-1.
> > As I recall, from the callback, no.
> > 
> > The error is not lost however, it will be returned by the API call
> > itself.  eg. If you're in nbd_aio_opt_list -> callback (error)
then
> > nbd_aio_opt_list will return -1 and at that point you can use
> > nbd_get_error to report the error.
> 
> 
> I don't understand. If I call `nbd_aio_opt_list()` with a completion
> callback. `nbd_aio_opt_list()` will return immediately, maybe with a
> successful result. But the command is not complete until
> `nbd_aio_is_connecting()` returns `false`, so the completion callback may
be
> invoked with an error after `nbd_aio_opt_list()` has returned.
When you call nbd_aio_opt_list(), here are several possible scenarios
that can[*] happen (not exhaustive, but should be a good starting
point for thinking about it):

1. pre-transmission early failure
- call nbd_aio_opt_list
  - early failure detected, nothing queued to send to server
  - call list_callback.free if provided
  - call completion_callback.free if provided
  - nbd_aio_opt_list returns -1
- nbd_aio_is_negotiating is unchanged (the early failure did not
  change the state, .callbacks were not invoked, and .free is done
  before aio)

2. command succeeds after blocking
- nbd_aio_is_negotiating is true (otherwise we were in scenario 1)
- call nbd_aio_opt_list
  - no early failure detected, command is queued via nbd_internal_run()
    - state machine changes to connecting instead of negotiating
    - advance through states to send the request, hit EAGAIN waiting to
      read the server's reply
    - nbd_internal_run returns 0
  - nbd_aio_opt_list returns 0
- nbd_aio_is_negotiating is false, you must manually move the state
  machine along once you have data (by nbd_aio_poll,
  nbd_aio_notify_read, ...)
  - state machine reaches server responses
  - list_callback.callback called if server's reply includes a name
  - state machine gets to end of server's message
  - call completion_callback.callback if provided, with error 0
  - state set back to negotiating
  - call list_callback.free (if provided)
  - call completion_callback.free (if provided)
  - nbd_aio_poll/nbd_aio_notify_write/... returns 0
- nbd_aio_is_negotiating is true again (callbacks reached after the aio
  call completed)

3. server failure not detected until after blocking
- nbd_aio_is_negotiating is true (otherwise we were in scenario 1)
- call nbd_aio_opt_list
  - no early failure detected, command is queued via nbd_internal_run()
    - state machine changes to connecting instead of negotiating
    - advance through states to send the request, hit EAGAIN waiting to
      read the server's reply
    - nbd_internal_run returns 0
  - nbd_aio_opt_list returns 0
- nbd_aio_is_negotiating is false, you must manually move the state
  machine along once you have data (by nbd_aio_poll,
  nbd_aio_notify_read, ...)
  - state machine reaches server responses
  - list_callback.callback called if server's reply includes a name
  - state machine gets to end of server's message
  - call completion_callback.callback if provided, with error reported
    by the server
  - state set back to negotiating
  - call list_callback.free (if provided)
  - call completion_callback.free (if provided)
  - nbd_aio_poll/nbd_aio_notify_write/... returns 0
- nbd_aio_is_negotiating is true again (you had to use a completion
  callback to know about the server failure)

4. transmission failure after blocking
- nbd_aio_is_negotiating is true (otherwise we were in scenario 1)
- call nbd_aio_opt_list
  - no early failure detected, command is queued via nbd_internal_run()
    - state machine changes to connecting instead of negotiating
    - advance through states to send the request, hit EAGAIN waiting to
      read the server's reply
    - nbd_internal_run returns 0
  - nbd_aio_opt_list returns 0
- nbd_aio_is_negotiating is false, you must manually move the state
  machine along once you have data (by nbd_aio_poll,
  nbd_aio_notify_read, ...)
  - state machine sees server is no longer connected
  - state set to DEAD
  - call completion_callback.callback if provided, with error ENOTCONN
  - call list_callback.free (if provided)
  - call completion_callback.free (if provided)
  - nbd_aio_poll/nbd_aio_notify_write/... returns -1
- nbd_aio_is_negotiating is false (the completion callback occurs
  after the aio success)

5. server failure detected without blocking (unlikely, but possible
under gdb or heavy system load)
- nbd_aio_is_negotiating is true (otherwise we were in scenario 1)
- call nbd_aio_opt_list
  - no early failure detected, command is queued via nbd_internal_run()
    - state machine changes to connecting instead of negotiating
    - advance through states to send the request and read the server's
      reply
    - list_callback.callback might be called if server's reply includes
      a name
    - state machine gets to end of server's message
    - call completion_callback.callback if provided, with error set to
      the server's error
    - state set back to negotiating
    - call list_callback.free (if provided)
    - call completion_callback.free (if provided)
    - nbd_internal_run returns 0
  - nbd_aio_opt_list returns 0
- nbd_aio_is_negotiating is once again true (.free ran before the aio
  call completed, and you can directly issue another nbd_aio_opt
  command)

6. list callback requests command failure despite server success
- nbd_aio_is_negotiating is true (otherwise we were in scenario 1)
- call nbd_aio_opt_list
  - no early failure detected, command is queued via nbd_internal_run()
    - state machine changes to connecting instead of negotiating
    - advance through states to send the request, hit EAGAIN waiting to
      read the server's reply
    - nbd_internal_run returns 0
  - nbd_aio_opt_list returns 0
- nbd_aio_is_negotiating is false, you must manually move the state
  machine along once you have data (by nbd_aio_poll,
  nbd_aio_notify_read, ...)
  - state machine reaches server responses
  - list_callback.callback called if server's reply includes a name,
    and callback sets err and returns -1
  - state machine gets to end of server's success message
  - call completion_callback.callback if provided, with error from the
    earlier callback
  - state set back to negotiating
  - call list_callback.free (if provided)
  - call completion_callback.free (if provided)
  - nbd_aio_poll/nbd_aio_notify_write/... returns 0
- nbd_aio_is_negotiating is true again

[*] possibly modulo patches I'm about to post, if that isn't already
the case...
> 
> 
> Also, does the value of `err` (passed into a caller) has any meaning like a
> known errno value or something else that can be converted to a description?
> Or is it just an arbitrary non zero integer?
It is the POSIX errno value, either translated from the server's
response (e.g., if the server fails NBD_CMD_WRITE with NBD_ENOSPC,
you'll get your local system's value of ENOSPC even if it differs from
the NBD protocol's value), or injected directly by libnbd (e.g., we
have several spots where we inject EPROTO for cases where libnbd has
detected that the server is not compliant to the NBD protocol, even
though there is no NBD_EPROTO error in the protocol).  If 'err' is 0,
the overall command succeeded (you know the server did the command).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Libguestfs - Jul 2023 - libnbd: When are callbacks freed

[Libguestfs] libnbd: When are callbacks freed

[Libguestfs] libnbd: When are callbacks freed

[Libguestfs] libnbd: When are callbacks freed