Linus Torvalds
2023-May-25  16:20 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On Thu, May 25, 2023 at 8:30?AM Eric W. Biederman <ebiederm at xmission.com> wrote:> > Basically with no patches that is where Linus's kernel is. > > User complained about the new thread showing up in ps.Well, not only that, but it actively broke existing workflows for managing things. Showing up in 'ps' wasn't just some purely cosmetic issue, but had semantic meaning. And honestly, I think the core issue is that we should just make this work. Kernel threads randomly switching to user memory threads was wrong, so CLONE_VM is absolutely the right thing to do. But while "CLONE_VM without real threading" is a very traditional thing in Linux - it was the original model for clone(), after all - I don't believe it is the *correct* model. There was a very real reason clone() has grown CLONE_THREAD and friends. So honestly, I really think we want to complete the vhost move to CLONE_THREAD (and thus CLONE_SIGNAL). Not because the old kthread model didn't _work_, but because it's really really wrong to try to randomly take on user-space attributes at run-time. And once you do the "user threads in kernel space" model, at that point you really do want to act like a proper thread. Both because of that 'ps' issue (which is really just "show the world what your relationship is), but simply because that is the modern threading model that we use for everything else, and special cases are bad. So I'd really like to finish this. Even if we end up with a hack or two in signal handling that we can hopefully fix up later by having vhost fix up some of its current assumptions. It has worked wonderfully well for io_uring - but we *did* have quite a bit of conversion patches over some time as people found issues. Which is why I don't expect the vhost conevrsion to be entirely pain-free either, and I don't think we necessarily have to get to a "perfect and clean" state immediately, just a "working and conceptually in the right direction" state. Linus
Eric W. Biederman
2023-May-27  09:49 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
Linus Torvalds <torvalds at linux-foundation.org> writes:> So I'd really like to finish this. Even if we end up with a hack or > two in signal handling that we can hopefully fix up later by having > vhost fix up some of its current assumptions.The real sticky widget for me is how to handle one of these processes coredumping. It really looks like it will result in a reliable hang. Limiting ourselves to changes that will only affect vhost, all I can see would be allowing the vhost_worker thread to exit as soon as get_signal reports the process is exiting. Then vhost_dev_flush would need to process the pending work. Something like this: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..fb5ebc50c553 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -234,14 +234,31 @@ EXPORT_SYMBOL_GPL(vhost_poll_stop); void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; + struct vhost_worker *worker = dev->worker; + struct llist_node *node, *head; + + if (!worker) + return; + + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); - if (dev->worker) { - init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); + vhost_work_queue(dev, &flush.work); - vhost_work_queue(dev, &flush.work); - wait_for_completion(&flush.wait_event); + /* Either vhost_worker runs the pending work or we do */ + node = llist_del_all(&worker->work_list); + if (node) { + node = llist_reverse_order(node); + /* make sure flag is seen after deletion */ + smp_wmb(); + llist_for_each_entry_safe(work, work_next, node, node) { + clear_bit(VHOST_WORK_QUEUED, &work->flags); + work->fn(work); + cond_resched(); + } } + + wait_for_completion(&flush.wait_event); } EXPORT_SYMBOL_GPL(vhost_dev_flush); @@ -338,6 +355,7 @@ static int vhost_worker(void *data) struct vhost_worker *worker = data; struct vhost_work *work, *work_next; struct llist_node *node; + struct ksignal ksig; for (;;) { /* mb paired w/ kthread_stop */ @@ -348,6 +366,9 @@ static int vhost_worker(void *data) break; } + if (get_signal(&ksig)) + break; + node = llist_del_all(&worker->work_list); if (!node) schedule(); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..613d52f01c07 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -47,6 +47,7 @@ void vhost_task_stop(struct vhost_task *vtsk) * not exiting then reap the task. */ kernel_wait4(pid, NULL, __WCLONE, NULL); + put_task_struct(vtsk->task); kfree(vtsk); } EXPORT_SYMBOL_GPL(vhost_task_stop); @@ -101,7 +102,7 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, return NULL; } - vtsk->task = tsk; + vtsk->task = get_task_struct(tsk); return vtsk; } EXPORT_SYMBOL_GPL(vhost_task_create); Eric