On Wed, Jul 12, 2023 at 03:19:59PM +0000, Tage Johansson wrote:> Hello, > > > While writing some tests for the Rust bindings, I discovered a memory leak > with Valgrind due to a completion callback not being freed. More specificly, > the completion callback of `aio_opt_info()` doesn't seem to be if > `aio_opt_info()` failes. In this case, `aio_opt_info()` was called in the > connected state, so it should indeed fail, but it should perhaps also call > the free function associated with the callback.Can you paste the valgrind output showing the leak? It is important to view the generated lib/api.c. Note that nbd_aio_opt_info() unconditionally calls any remaining free callback prior to returning -1. If you called nbd_aio_opt_info() from the wrong state, we never even get to the nbd_unlocked_aio_opt_info() code in lib/opt.c. So the only thing we really have to worry about is if we have any exit paths where we have claimed ownership of the callback by transferring it elsewhere (that is, where we called SET_CALLBACK_TO_NULL because we have set up h->opt_cb) but still return -1 without our claimed ownership still having a reachable path for still calling the callback at the right time.> > > According to libnbd(3) <https://libguestfs.org/libnbd.3.html>: > > > Callback lifetimes > > You can associate an optional free function with callbacks. Libnbd will > > call this function when the callback will not be called again by libnbd, > > including in the case where the API fails. > > After studying the code in lib/opt.c, I think that the situation with > synchronous callbacks like `nbd_opt_list()` is even worse. Here the list > callback will not be freed if the internal call to > `nbd_unlocked_aio_opt_list()` failes, but it will be freed if the completion > callback got an error which is returned by `nbd_opt_list()`. > > > > /* Issue NBD_OPT_LIST and wait for the reply. */ > > int > > nbd_unlocked_opt_list (struct nbd_handle *h, nbd_list_callback *list) > > { > > ? struct list_helper s = { .list = *list }; > > ? nbd_list_callback l = { .callback = list_visitor, .user_data = &s }; > > ? nbd_completion_callback c = { .callback = list_complete, .user_data > > &s }; > > > > ? if (nbd_unlocked_aio_opt_list (h, &l, &c) == -1) > Here the list callback has not been freed. > > ??? return -1;True at this point, but it also has not been NULL'd. But nbd_unlocked_aio_opt_list() can ONLY return -1 if fails prior to the SET_CALLBACK_TO_NULL lines, so we also know *list still contains the free callback, and the wrapper nbd_opt_list() in lib/api.c WILL call the callback one function up in the call stack, so there is no leak here. [meta-comment: I find that when repling inline, which is common practice in patch reviews, it helps if I include a blank line before and after anything I inject, so my reply text doesn't get overlooked when scanning the wall of quoted text]> > > > ? SET_CALLBACK_TO_NULL (*list); > > ? if (wait_for_option (h) == -1) > > ??? return -1;But here, we've explicitly taken ownership of *list (that is, successful nbd_unlocked_aio_opt_list() transferred the callback over to h->opt_cb.fn.list), so if wait_for_option() does not free the callback, then we must do so or we indeed have a bug. But to actually figure that out, I have to inspect the state machine, because wait_for_option() itself does not currently do anything with the callbacks, on success or on failure. /me goes and digs further... I _do_ see that nbd_internal_free_option() calls the appropriate callbacks. So the question is if we can guarantee that h->opt_cb was set, do we always reach nbd_internal_free_option() prior to nbd_unlocked_poll() progressing the state machine to the point that we either succeeded or transitioned out of state_connecting. When I look over the older lib/rw.c, I remember spending quite a bit of time patching it to get the handling correct. Using nbd_unlocked_pread_structured as the comparison: if nbd_unlocked_aio_pread_structured() returns -1, then either *chunk has not yet been set to NULL (so the api.c nbd_pread_structured will call the free callback), or the code ensured that the transfer semantics happened (and now the state machine will take care of calling the free callback at the right point in time on cmd->cb). There's even code in nbd_internal_common_command() that frees the callback before returning -1 when the command could not be queued to the state machine. I even remember how much time I spent writing tests/test-errors.c (which Rich then later subdivided into finer-grained tests) covering various aspects (failure when called in the wrong state, failure with parameter validation before contacting the server, failure returned by the server, ...) and that each of them didn't leak. But I would not be surprised if my more recent additions of lib/opt.c missed some steps (unlike commands, you don't have parallel options in flight, so the code is simpler - but maybe I simplified too much), where we could have a leak. At any rate, I could argue that in the synchronous nbd_unlocked_opt functions, we should probably be able to assert that the callback was already NULL'd by the successful nbd_unlocked_aio_opt counterpart on success, rather than manually trying to re-NULL it ourselves.> > ? if (s.err) { > But here I think that the callback has been freed. > > ??? set_error (s.err, "server replied with error to list request"); > > ??? return -1; > > ? } > > ? return s.count; > > } > > The problem is that if `nbd_opt_list()` returns an error, I as a user cannot > know wether I should free the data associated with the list callback myself > or if that has been done already.The synchronous calls should ALWAYS call the free callback by the time they complete, whether on success or on error. If they don't, that's a bug. Looks like I should expand tests/opt-list.c to pass in a free callback, to ensure that it is reached the correct number of times.> > > Maybe I am just misunderstanding the code or the API, but if not then I > wonder how I should know when I need to free the user data associated with a > callback myself and when that is done by libnbd?It should always be done by libnbd. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
On Wed, Jul 12, 2023 at 03:33:28PM -0500, Eric Blake wrote:> > > /* Issue NBD_OPT_LIST and wait for the reply. */ > > > int > > > nbd_unlocked_opt_list (struct nbd_handle *h, nbd_list_callback *list) > > > { > > > ? struct list_helper s = { .list = *list }; > > > ? nbd_list_callback l = { .callback = list_visitor, .user_data = &s }; > > > ? nbd_completion_callback c = { .callback = list_complete, .user_data > > > &s }; > > > > > > ? if (nbd_unlocked_aio_opt_list (h, &l, &c) == -1)We are passing a stack-local variable that points to the user's callback...[1]> > Here the list callback has not been freed. > > > ??? return -1; > > True at this point, but it also has not been NULL'd. But > nbd_unlocked_aio_opt_list() can ONLY return -1 if fails prior to the > SET_CALLBACK_TO_NULL lines, so we also know *list still contains the > free callback, and the wrapper nbd_opt_list() in lib/api.c WILL call > the callback one function up in the call stack, so there is no leak > here. > > > > > > > ? SET_CALLBACK_TO_NULL (*list); > > > ? if (wait_for_option (h) == -1) > > > ??? return -1; > > But here, we've explicitly taken ownership of *list (that is, > successful nbd_unlocked_aio_opt_list() transferred the callback over > to h->opt_cb.fn.list), so if wait_for_option() does not free the > callback, then we must do so or we indeed have a bug. But to actually > figure that out, I have to inspect the state machine, because > wait_for_option() itself does not currently do anything with the > callbacks, on success or on failure. > > /me goes and digs further... > > I _do_ see that nbd_internal_free_option() calls the appropriate > callbacks. So the question is if we can guarantee that h->opt_cb was > set, do we always reach nbd_internal_free_option() prior to > nbd_unlocked_poll() progressing the state machine to the point that we > either succeeded or transitioned out of state_connecting. > >I meant to add that a grep for nbd_intenral_free_option() shows that all of the generator/states-newstyle-opt-*.c do indeed call this as part of their state machine. So once h->opt_cb is set (all of our synchronous nbd_opt calls do so), the state machine guarantees that we will then reach go_complete(), list_complete(), or context_complete() (whichever function we registered in our stack-local completion callback), which in turn calls the user's free callback before returning on the synchronous functions.> > When I look over the older lib/rw.c, I remember spending quite a bit > of time patching it to get the handling correct. Using > nbd_unlocked_pread_structured as the comparison: if > nbd_unlocked_aio_pread_structured() returns -1, then either *chunk has > not yet been set to NULL (so the api.c nbd_pread_structured will call > the free callback), or the code ensured that the transfer semantics > happened (and now the state machine will take care of calling the free > callback at the right point in time on cmd->cb). There's even code in > nbd_internal_common_command() that frees the callback before returning > -1 when the command could not be queued to the state machine. I even > remember how much time I spent writing tests/test-errors.c (which Rich > then later subdivided into finer-grained tests) covering various > aspects (failure when called in the wrong state, failure with > parameter validation before contacting the server, failure returned by > the server, ...) and that each of them didn't leak. > > But I would not be surprised if my more recent additions of lib/opt.c > missed some steps (unlike commands, you don't have parallel options in > flight, so the code is simpler - but maybe I simplified too much), > where we could have a leak.With my deeper look and testsuite additions, I haven't been able to prove a leak on either success or failure paths. Again, if you have a valgrind trace, please share it (the leak may be in your use of the code as part of the Rust bindings, and not in the core libnbd code).> > At any rate, I could argue that in the synchronous nbd_unlocked_opt > functions, we should probably be able to assert that the callback was > already NULL'd by the successful nbd_unlocked_aio_opt counterpart on > success, rather than manually trying to re-NULL it ourselves.[1] the aio function NULL'd our stack-local variable, not the user's callback, but we did successfully transfer it to the state machine, so this explicit NULL (rather than asserting that it was already nulled) is correct after all. Still, I'm not opposed to a patch adding comments to lib/opt.c to make things clearer. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
On 7/12/2023 10:33 PM, Eric Blake wrote:> On Wed, Jul 12, 2023 at 03:19:59PM +0000, Tage Johansson wrote: >> Hello, >> >> >> While writing some tests for the Rust bindings, I discovered a memory leak >> with Valgrind due to a completion callback not being freed. More specificly, >> the completion callback of `aio_opt_info()` doesn't seem to be if >> `aio_opt_info()` failes. In this case, `aio_opt_info()` was called in the >> connected state, so it should indeed fail, but it should perhaps also call >> the free function associated with the callback. > Can you paste the valgrind output showing the leak?The Valgrind output is very Rust specific and only shows the Rust allocations which goes into the completion callback and are lost. But if you apply the following patch (which modifies tests/opt-info.c) you shall see that the completion callback is not called. I have replaced a call to `nbd_opt_info()` with a call to `nbd_aio_opt_info()` and passed in a completion callback which just calls `exit(EXIT_FAILURE)`. So if the completion callback is called the test should fail, which it doesn't, at least not on my machine. diff --git a/tests/opt-info.c b/tests/opt-info.c index 86a0608..ba5c72b 100644 --- a/tests/opt-info.c +++ b/tests/opt-info.c @@ -30,6 +30,12 @@ ?#include <libnbd.h> +int +my_completion_cb(void *_user_data, int *_err) +{ +? exit(EXIT_FAILURE); +} + ?int ?main (int argc, char *argv[]) ?{ @@ -201,7 +207,10 @@ main (int argc, char *argv[]) ???? fprintf (stderr, "expecting export name 'good', got '%s'\n", s); ???? exit (EXIT_FAILURE); ?? } -? if (nbd_opt_info (nbd) != -1) { +? if (nbd_aio_opt_info ( +??????? nbd, +??????? (nbd_completion_callback) { .callback = my_completion_cb } +????? ) != -1) { ???? fprintf (stderr, "expecting error for opt_info\n"); ???? exit (EXIT_FAILURE); ?? } Best regards, Tage
Apologize if resending, but I'm not sure my previous email was actually delivered. On 7/12/2023 10:33 PM, Eric Blake wrote:> On Wed, Jul 12, 2023 at 03:19:59PM +0000, Tage Johansson wrote: >> Hello, >> >> >> While writing some tests for the Rust bindings, I discovered a memory >> leak >> with Valgrind due to a completion callback not being freed. More >> specificly, >> the completion callback of `aio_opt_info()` doesn't seem to be if >> `aio_opt_info()` failes. In this case, `aio_opt_info()` was called in the >> connected state, so it should indeed fail, but it should perhaps also >> call >> the free function associated with the callback. > Can you paste the valgrind output showing the leak?The Valgrind output is very Rust specific and only shows the Rust allocations which goes into the completion callback and are lost. But if you apply the following patch (which modifies tests/opt-info.c) you shall see that the completion callback is not called. I have replaced a call to `nbd_opt_info()` with a call to `nbd_aio_opt_info()` and passed in a completion callback which just calls `exit(EXIT_FAILURE)`. So if the completion callback is called the test should fail, which it doesn't, at least not on my machine. diff --git a/tests/opt-info.c b/tests/opt-info.c index 86a0608..ba5c72b 100644 --- a/tests/opt-info.c +++ b/tests/opt-info.c @@ -30,6 +30,12 @@ ?#include <libnbd.h> +int +my_completion_cb(void *_user_data, int *_err) +{ +? exit(EXIT_FAILURE); +} + ?int ?main (int argc, char *argv[]) ?{ @@ -201,7 +207,10 @@ main (int argc, char *argv[]) ???? fprintf (stderr, "expecting export name 'good', got '%s'\n", s); ???? exit (EXIT_FAILURE); ?? } -? if (nbd_opt_info (nbd) != -1) { +? if (nbd_aio_opt_info ( +??????? nbd, +??????? (nbd_completion_callback) { .callback = my_completion_cb } +????? ) != -1) { ???? fprintf (stderr, "expecting error for opt_info\n"); ???? exit (EXIT_FAILURE); ?? } Best regards, Tage
On Thu, Jul 13, 2023 at 09:53:58AM +0000, Tage Johansson wrote:> Apologize if resending, but I'm not sure my previous email was > actually delivered. > > > On 7/12/2023 10:33 PM, Eric Blake wrote: > > > >On Wed, Jul 12, 2023 at 03:19:59PM +0000, Tage Johansson wrote: > >>Hello, > >> > >> > >>While writing some tests for the Rust bindings, I discovered a > >>memory leak > >>with Valgrind due to a completion callback not being freed. More > >>specificly, > >>the completion callback of `aio_opt_info()` doesn't seem to be if > >>`aio_opt_info()` failes. In this case, `aio_opt_info()` was called in the > >>connected state, so it should indeed fail, but it should perhaps > >>also call > >>the free function associated with the callback. > >Can you paste the valgrind output showing the leak? > > > The Valgrind output is very Rust specific and only shows the Rust > allocations which goes into the completion callback and are lost. > > > But if you apply the following patch (which modifies > tests/opt-info.c) you shall see that the completion callback is not > called. > > I have replaced a call to `nbd_opt_info()` with a call to > `nbd_aio_opt_info()` and passed in a completion callback which just > calls `exit(EXIT_FAILURE)`. So if the completion callback is called > the test should fail, which it doesn't, at least not on my machine.Isn't that OK? Only .free is required to be called. Rich.> > diff --git a/tests/opt-info.c b/tests/opt-info.c > index 86a0608..ba5c72b 100644 > --- a/tests/opt-info.c > +++ b/tests/opt-info.c > @@ -30,6 +30,12 @@ > > ?#include <libnbd.h> > > +int > +my_completion_cb(void *_user_data, int *_err) > +{ > +? exit(EXIT_FAILURE); > +} > + > ?int > ?main (int argc, char *argv[]) > ?{ > @@ -201,7 +207,10 @@ main (int argc, char *argv[]) > ???? fprintf (stderr, "expecting export name 'good', got '%s'\n", s); > ???? exit (EXIT_FAILURE); > ?? } > -? if (nbd_opt_info (nbd) != -1) { > +? if (nbd_aio_opt_info ( > +??????? nbd, > +??????? (nbd_completion_callback) { .callback = my_completion_cb } > +????? ) != -1) { > ???? fprintf (stderr, "expecting error for opt_info\n"); > ???? exit (EXIT_FAILURE); > ?? } > > Best regards, > > Tage > >-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW