thr3ads.net - Libguestfs - [Libguestfs] libnbd completion callback question [Sep 2020]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2020-Sep-05 12:47 UTC

[Libguestfs] libnbd completion callback question

I noticed while reading the code that we have a documentation hole that 
may cause memory leaks for clients that are unaware, in relation to 
completion callbacks.

The situation arises as follows: for all commands with a completion 
callback, I checked that the code has clean semantics: either 
nbd_aio_FOO() returns -1 and we never call the callback cleanup, or 
nbd_aio_FOO() returns a cookie and we call the callback cleanup when the 
command is retired.  And since completion callbacks can only ever be 
passed to nbd_aio_FOO() functions, it is simple enough to document that 
when nbd_aio_FOO() fails (possible when calling them while the state 
machine is in the wrong state, or with invalid flags, or a command that 
the server is unwilling to accept - all things we can check client-side 
without traffic to the server), then the callback function was NOT 
registered, and the user must clean any resources then and there to 
avoid a leak.  Only when nbd_aio_FOO() succeeds will the responsibility 
for cleanup be handled by the callback.

But for nbd_block_status and nbd_pread_structured, we have a second 
callback.  And _these_ callbacks have a problem: we can return -1 for 
two different reasons, either because the command was never attempted 
(nbd_aio_FOO failed) and so the callback will never be cleaned, or 
because the command got sent to the server and the server failed it (the 
callback WILL be cleaned).  As this is ambiguous, it risks being a 
memory leak.  So I think we have a bug in our code, and need to 
strengthen our promise of whether the cleanup callback will be called, 
regardless of success or failure, while still minimizing any incompat 
changes that might cause a double-free for existing clients.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2020-Sep-05 13:56 UTC

head link

Re: [Libguestfs] libnbd completion callback question

On 9/5/20 7:47 AM, Eric Blake wrote:> I noticed while reading the code that we have a documentation hole that 
> may cause memory leaks for clients that are unaware, in relation to 
> completion callbacks.
> 
> The situation arises as follows: for all commands with a completion 
> callback, I checked that the code has clean semantics: either 
> nbd_aio_FOO() returns -1 and we never call the callback cleanup, or 
> nbd_aio_FOO() returns a cookie and we call the callback cleanup when the 
> command is retired.  And since completion callbacks can only ever be 
> passed to nbd_aio_FOO() functions, it is simple enough to document that 
> when nbd_aio_FOO() fails (possible when calling them while the state 
> machine is in the wrong state, or with invalid flags, or a command that 
> the server is unwilling to accept - all things we can check client-side 
> without traffic to the server), then the callback function was NOT 
> registered, and the user must clean any resources then and there to 
> avoid a leak.  Only when nbd_aio_FOO() succeeds will the responsibility 
> for cleanup be handled by the callback.
> 
> But for nbd_block_status and nbd_pread_structured, we have a second 
> callback.  And _these_ callbacks have a problem: we can return -1 for 
> two different reasons, either because the command was never attempted 
> (nbd_aio_FOO failed) and so the callback will never be cleaned, or 
> because the command got sent to the server and the server failed it (the 
> callback WILL be cleaned).  As this is ambiguous, it risks being a 
> memory leak.  So I think we have a bug in our code, and need to 
> strengthen our promise of whether the cleanup callback will be called, 
> regardless of success or failure, while still minimizing any incompat 
> changes that might cause a double-free for existing clients.
Thinking about it more, it is probably easiest to just declare that we 
guarantee the cleanup callback is called regardless of failure mode, for 
all closures, and that we merely had a memory leak in earlier libnbd 
releases.  Yes, this means that we have a risk of older apps hitting a 
double free if they cleaned up after a cookie of -1 to compensate for 
our failure to cleanup; but it is unlikely, since the code for an app to 
call the cleanup manually is more verbose, and since good apps are 
unlikely to call functions that are going to fail client-side to trigger 
the problem in the first place.  I'll post a patch, including coverage 
in tests/closure-lifetimes.c, to demonstrate what I mean.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2020-Sep-05 19:07 UTC

head link

Re: [Libguestfs] libnbd completion callback question

On 9/5/20 8:56 AM, Eric Blake wrote:> Thinking about it more, it is probably easiest to just declare that we 
> guarantee the cleanup callback is called regardless of failure mode, for 
> all closures, and that we merely had a memory leak in earlier libnbd 
> releases.  Yes, this means that we have a risk of older apps hitting a 
> double free if they cleaned up after a cookie of -1 to compensate for 
> our failure to cleanup; but it is unlikely, since the code for an app to 
> call the cleanup manually is more verbose, and since good apps are 
> unlikely to call functions that are going to fail client-side to trigger 
> the problem in the first place.  I'll post a patch, including coverage 
> in tests/closure-lifetimes.c, to demonstrate what I mean.
Proof that common clients have a memory leak (rather than a risk of a 
double free error), and thus that this is worth fixing by guaranteeing 
closure cleanup even on client-side failures:

$ export count=10
$ nbdkit null 1m --run 'export uri; /usr/bin/time -v nbdsh -c "
h.set_request_structured_replies(False)
def f (c, o, e, err):
   pass
import os
h.connect_uri(os.environ[\"uri\"])
for _ in range(int(os.environ[\"count\"])):
   try:
     h.block_status(512, 0, f)
   except nbd.Error as ex:
     pass
print(\"got here\")
"' 2>&1 | grep '\(here\|resident\)'
got here
print("got here")
	Maximum resident set size (kbytes): 14416
	Average resident set size (kbytes): 0
$ count=10000
$ nbdkit null 1m --run 'export uri; /usr/bin/time -v nbdsh -c "
h.set_request_structured_replies(False)
def f (c, o, e, err):
   pass
import os
h.connect_uri(os.environ[\"uri\"])
for _ in range(int(os.environ[\"count\"])):
   try:
     h.block_status(512, 0, f)
   except nbd.Error as ex:
     pass
print(\"got here\")
"' 2>&1 | grep '\(here\|resident\)'
got here
print("got here")
	Maximum resident set size (kbytes): 17796
	Average resident set size (kbytes): 0

Ideally, increasing the loop iterations should have no effect on the max 
memory usage.  Replacing the h.set_request_structured_replies(False) 
line with h.add_meta_context(nbd.CONTEXT_BASE_ALLOCATION) (so that the 
block status request succeeds instead) proves the point; after that 
change, I'm seeing 17296k with count=1 and 17284k with count=10000 (that 
is, the memory usage is higher and somewhat less deterministic because 
there is now server interactions, but definitely no longer something 
that scales up as count increases).

And in proving that, I found several _other_ bugs, now fixed: Python.ml 
was mapping Bool incorrectly (so that 
h.set_request_structured_replies(False) was often setting things to true 
instead); which warranted testsuite coverage of functions previously 
uncalled under Python or Ocaml testsuites, and flushed out bugs in ocaml 
NBD.set_tls and NBD.set_handshake_flags.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Maybe Matching Threads

Search for more possibly parallel threads

Libguestfs - Sep 2020 - libnbd completion callback question

[Libguestfs] libnbd completion callback question

Re: [Libguestfs] libnbd completion callback question

Re: [Libguestfs] libnbd completion callback question

Maybe Matching Threads