Oleg Nesterov
2023-May-23 12:15 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 05/22, Oleg Nesterov wrote:> > Right now I think that "int dead" should die,No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL.> but let me think tomorrow.May be something like this... I don't like it but I can't suggest anything better right now. bool killed = false; for (;;) { ... node = llist_del_all(&worker->work_list); if (!node) { schedule(); /* * When we get a SIGKILL our release function will * be called. That will stop new IOs from being queued * and check for outstanding cmd responses. It will then * call vhost_task_stop to tell us to return and exit. */ if (signal_pending(current)) { struct ksignal ksig; if (!killed) killed = get_signal(&ksig); clear_thread_flag(TIF_SIGPENDING); } continue; } ------------------------------------------------------------------------------- But let me ask a couple of questions. Let's forget this patch, let's look at the current code: node = llist_del_all(&worker->work_list); if (!node) schedule(); node = llist_reverse_order(node); ... process works ... To me this looks a bit confusing. Shouldn't we do if (!node) { schedule(); continue; } just to make the code a bit more clear? If node == NULL then llist_reverse_order() and llist_for_each_entry_safe() will do nothing. But this is minor. /* make sure flag is seen after deletion */ smp_wmb(); llist_for_each_entry_safe(work, work_next, node, node) { clear_bit(VHOST_WORK_QUEUED, &work->flags); I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, vhost_work_queue() can add this work again and change work->node->next. That is why we use _safe, but we need to ensure that llist_for_each_safe() completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. So it seems that smp_wmb() can't help and should be removed, instead we need llist_for_each_entry_safe(...) { smp_mb__before_atomic(); clear_bit(VHOST_WORK_QUEUED, &work->flags); Also, if the work->fn pointer is not stable, we should read it before smp_mb__before_atomic() as well. No? __set_current_state(TASK_RUNNING); Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() can return with current->state != RUNNING ? work->fn(work); Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right before we call work->fn(). Is it "safe" to run this callback with signal_pending() or fatal_signal_pending() ? Finally. I never looked into drivers/vhost/ before so I don't understand this code at all, but let me ask anyway... Can we change vhost_dev_flush() to run the pending callbacks rather than wait for vhost_worker() ? I guess we can't, ->mm won't be correct, but can you confirm? Oleg.
Eric W. Biederman
2023-May-23 15:57 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
Oleg Nesterov <oleg at redhat.com> writes:> On 05/22, Oleg Nesterov wrote: >> >> Right now I think that "int dead" should die, > > No, probably we shouldn't call get_signal() if we have already > dequeued SIGKILL.Very much agreed. It is one thing to add a patch to move do_exit out of get_signal. It is another to keep calling get_signal after that. Nothing tests that case, and so we get some weird behaviors.>> but let me think tomorrow. > > May be something like this... I don't like it but I can't suggest anything better > right now. > > bool killed = false; > > for (;;) { > ... > > node = llist_del_all(&worker->work_list); > if (!node) { > schedule(); > /* > * When we get a SIGKILL our release function will > * be called. That will stop new IOs from being queued > * and check for outstanding cmd responses. It will then > * call vhost_task_stop to tell us to return and exit. > */ > if (signal_pending(current)) { > struct ksignal ksig; > > if (!killed) > killed = get_signal(&ksig); > > clear_thread_flag(TIF_SIGPENDING); > } > > continue; > }I want to point out that we need to consider not just SIGKILL, but SIGABRT that causes a coredump, as well as the process peforming an ordinary exit(2). All of which will cause get_signal to return SIGKILL in this context.> > ------------------------------------------------------------------------------- > But let me ask a couple of questions.I share most of these questions.> Let's forget this patch, let's look at the > current code: > > node = llist_del_all(&worker->work_list); > if (!node) > schedule(); > > node = llist_reverse_order(node); > ... process works ... > > To me this looks a bit confusing. Shouldn't we do > > if (!node) { > schedule(); > continue; > } > > just to make the code a bit more clear? If node == NULL then > llist_reverse_order() and llist_for_each_entry_safe() will do nothing. > But this is minor. > > > > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > vhost_work_queue() can add this work again and change work->node->next. > > That is why we use _safe, but we need to ensure that llist_for_each_safe() > completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > So it seems that smp_wmb() can't help and should be removed, instead we need > > llist_for_each_entry_safe(...) { > smp_mb__before_atomic(); > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > Also, if the work->fn pointer is not stable, we should read it before > smp_mb__before_atomic() as well. > > No? > > > __set_current_state(TASK_RUNNING); > > Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > can return with current->state != RUNNING ? > > > work->fn(work); > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm?In a conversation long ago I remember hearing that vhost does not support file descriptor passing. Which means all of the file descriptors should be in the same process. Looking at the vhost code what I am seeing happening is that the vhost_worker persists until vhost_dev_cleanup is called from one of the vhost_???_release() functions. The release functions are only called after the last flush function completes. See __fput if you want to trace the details. On one hand this all seems reasonable. On the other hand I am not seeing the code that prevents file descriptor passing. It is probably not the worst thing in the world, but what this means is now if you pass a copy of the vhost file descriptor to another process the vhost_worker will persist, and thus the process will persist until that copy of the file descriptor is closed. Eric
Mike Christie
2023-May-24 00:02 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
Hey Oleg, For all these questions below let me get back to you by tomorrow. I need to confirm if something would be considered a regression by the core vhost developers. On 5/23/23 7:15 AM, Oleg Nesterov wrote:> On 05/22, Oleg Nesterov wrote: >> >> Right now I think that "int dead" should die, > > No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL. > >> but let me think tomorrow. > > May be something like this... I don't like it but I can't suggest anything better > right now. > > bool killed = false; > > for (;;) { > ... > > node = llist_del_all(&worker->work_list); > if (!node) { > schedule(); > /* > * When we get a SIGKILL our release function will > * be called. That will stop new IOs from being queued > * and check for outstanding cmd responses. It will then > * call vhost_task_stop to tell us to return and exit. > */ > if (signal_pending(current)) { > struct ksignal ksig; > > if (!killed) > killed = get_signal(&ksig); > > clear_thread_flag(TIF_SIGPENDING); > } > > continue; > } > > ------------------------------------------------------------------------------- > But let me ask a couple of questions. Let's forget this patch, let's look at the > current code: > > node = llist_del_all(&worker->work_list); > if (!node) > schedule(); > > node = llist_reverse_order(node); > ... process works ... > > To me this looks a bit confusing. Shouldn't we do > > if (!node) { > schedule(); > continue; > } > > just to make the code a bit more clear? If node == NULL then > llist_reverse_order() and llist_for_each_entry_safe() will do nothing. > But this is minor. > > > > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > vhost_work_queue() can add this work again and change work->node->next. > > That is why we use _safe, but we need to ensure that llist_for_each_safe() > completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > So it seems that smp_wmb() can't help and should be removed, instead we need > > llist_for_each_entry_safe(...) { > smp_mb__before_atomic(); > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > Also, if the work->fn pointer is not stable, we should read it before > smp_mb__before_atomic() as well. > > No? > > > __set_current_state(TASK_RUNNING); > > Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > can return with current->state != RUNNING ? > > > work->fn(work); > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? > > Oleg. >
Mike Christie
2023-May-25 16:15 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/23/23 7:15 AM, Oleg Nesterov wrote:> > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ?The questions before this one I'll leave for the core vhost devs since they know best. For this question and the one below, when we get SIGKILL we think it's ok to fail the operation as in it's ok to not execute it like normal (send it down to the net/target/scsi/block layers for execution). However, we need to do some processing of the work to release mem, refcounts, etc. For example we use the vhost_task to handle completions from the layers we interact with. So when we get a SIGKILL, we could have N commands in the target/block/scsi/nvme layers below the vhost layer. When those complete, the current code assumes we have the vhost_task to handle the responses. So for pending works on that work_list, we can pretty easily kill them. However, we don't have a way to kill those outstanding requests to some other layer, so we have to wait and handle them somewhere. If you are saying that once we get SIGKILL then we just can't use the task anymore and we have to drop down to workqueue/kthread or change up the completion paths so they can execute in the context they are completed from by lower levels, then I can code this. On the vhost side, it's just really ugly vs the code we have now that used to use kthreads (or in 6.4-rc looks like a process) so we couldn't get signals and just had the one code path that removes us.>From the vhost point of view signals are not useful to us and it's justadding complexity for something that I don't think is useful to users. However after discussing this with guys for a week and going over the signal code, I can see your point of views where you guys are thinking its ugly for the signal code to try and support what we want :)> > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? > > Oleg. >
Jason Wang
2023-May-31 05:22 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
? 2023/5/23 20:15, Oleg Nesterov ??:> On 05/22, Oleg Nesterov wrote: >> Right now I think that "int dead" should die, > No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL. > >> but let me think tomorrow. > May be something like this... I don't like it but I can't suggest anything better > right now. > > bool killed = false; > > for (;;) { > ... > > node = llist_del_all(&worker->work_list); > if (!node) { > schedule(); > /* > * When we get a SIGKILL our release function will > * be called. That will stop new IOs from being queued > * and check for outstanding cmd responses. It will then > * call vhost_task_stop to tell us to return and exit. > */ > if (signal_pending(current)) { > struct ksignal ksig; > > if (!killed) > killed = get_signal(&ksig); > > clear_thread_flag(TIF_SIGPENDING); > } > > continue; > } > > ------------------------------------------------------------------------------- > But let me ask a couple of questions. Let's forget this patch, let's look at the > current code: > > node = llist_del_all(&worker->work_list); > if (!node) > schedule(); > > node = llist_reverse_order(node); > ... process works ... > > To me this looks a bit confusing. Shouldn't we do > > if (!node) { > schedule(); > continue; > } > > just to make the code a bit more clear? If node == NULL then > llist_reverse_order() and llist_for_each_entry_safe() will do nothing. > But this is minor.Yes.> > > > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > vhost_work_queue() can add this work again and change work->node->next. > > That is why we use _safe, but we need to ensure that llist_for_each_safe() > completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.This should be fine since store is not speculated, so work->node->next needs to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.> So it seems that smp_wmb() can't help and should be removed, instead we need > > llist_for_each_entry_safe(...) { > smp_mb__before_atomic(); > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > Also, if the work->fn pointer is not stable, we should read it before > smp_mb__before_atomic() as well.The fn won't be changed after it is initialized.> > No? > > > __set_current_state(TASK_RUNNING); > > Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > can return with current->state != RUNNING ?It is because the state were set to TASK_INTERRUPTIBLE in the beginning of the loop otherwise it might be side effect while executing work->fn().> > > work->fn(work); > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ?It looks safe since: 1) vhost hold refcnt of the mm 2) release will sync with the worker> > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm?Yes. Thanks> > Oleg. >