Mike Christie
2023-May-05 22:37 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/5/23 1:22 PM, Linus Torvalds wrote:> On Fri, May 5, 2023 at 6:40?AM Nicolas Dichtel > <nicolas.dichtel at 6wind.com> wrote: >> >> Is this an intended behavior? >> This breaks some of our scripts. > > It doesn't just break your scripts (which counts as a regression), I > think it's really wrong. > > The worker threads should show up as threads of the thing that started > them, not as processes. > > So they should show up in 'ps' only when one of the "show threads" flag is set. > > But I suspect the fix is trivial: the virtio code should likely use > CLONE_THREAD for the copy_process() it does. > > It should look more like "create_io_thread()" than "copy_process()", I think. > > For example, do virtio worker threads really want their own signals > and files? That sounds wrong. create_io_thread() uses all of > > CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO > > to share much more of the context with the process it is actually run within. >For the vhost tasks and the CLONE flags: 1. I didn't use CLONE_FILES in the vhost task patches because you are right and we didn't need our own. We needed it to work like kthreads where there are no files, so I set the kernel_clone_args.no_files bit to have copy_files not do a dup or clone (task->files is NULL). 2. vhost tasks didn't use CLONE_SIGHAND, because userspace apps like qemu use signals for management operations. But, the vhost thread's worker functions assume signals are ignored like they were with kthreads. So if they were doing IO and got a signal like a SIGHUP they might return early and fail from whatever network/block function they were calling. And currently the parent like qemu handles something like a SIGSTOP by shutting everything down by calling into the vhost interface to remove the device. So similar to files I used the kernel_clone_args.ignore_signals bit so copy_process has the vhost thread have it's own signal handle that just ignores signals. 3. I didn't use CLONE_THREAD because before my patches you could do "ps -u root" and see all the vhost threads. If we use CLONE_THREAD, then we can only see it when we do something like "ps -T -p $parent" like you mentioned above. I guess I messed up and did the reverse and thought it would be a regression if "ps -u root" no longer showed the vhost threads. If it's ok to change the behavior of "ps -u root", then we can do this patch: (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps' case. If you could test the ps only case or give me info on what /usr/bin/example was doing I can replicate and test here): diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..eb9ffc58e211 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process( /* * Thread groups must share signals as well, and detached threads * can only be started up within the thread group. + * + * A userworker's parent thread will normally have a signal handler + * that performs management operations, but the worker will not + * because the parent will handle the signal then user a worker + * specific interface to manage the thread and related resources. */ - if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) + if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) && + !args->user_worker && !args->ignore_signals) return ERR_PTR(-EINVAL); /* diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..3700c21ea39d 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_THREAD | CLONE_VM | + CLONE_UNTRACED, .exit_signal = 0, .fn = vhost_task_fn, .name = name,
Linus Torvalds
2023-May-06 01:53 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Fri, May 5, 2023 at 3:38?PM Mike Christie <michael.christie at oracle.com> wrote:> > If it's ok to change the behavior of "ps -u root", then we can do this patch:I think this is the right thing to do. Making the user worker threads show up as threads with the vhost process as the parent really seems like a much better model, and more accurate. Yes, they used to show up as random kernel threads, and you'd see them as such (not just for "ps -u root", but simply also with just a normal "ps ax" kind of thing). But that isn't all that helpful, and it's really just annoying to see our kernel threads in "ps ax" output, and I've often wished we didn't do that (it think of all the random "kworker/xyz-kcryptd" etc things that show up). So I think showing them as the threaded children of the vhost process is much nicer, and probably the best option. Because I don't thin kanything is going to get the *old* behavior of showing them as the '[vhost-xyz]' system threads (or whatever the old output ended up being in 'ps ax'), but hopefully nothing wants that horror anyway. At a minimum, the parenting is fundamentally going to look different in the new model. Linus
Thorsten Leemhuis
2023-May-13 12:39 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
[CCing the regression list] On 06.05.23 00:37, Mike Christie wrote:> On 5/5/23 1:22 PM, Linus Torvalds wrote: >> On Fri, May 5, 2023 at 6:40?AM Nicolas Dichtel >> <nicolas.dichtel at 6wind.com> wrote: >>> >>> Is this an intended behavior? >>> This breaks some of our scripts.Jumping in here, as I found another problem with that patch: it broke s2idle on my laptop when a qemu-kvm VM is running, as freezing user space processes now fails for me: ``` [ 195.442949] PM: suspend entry (s2idle) [ 195.641271] Filesystems sync: 0.198 seconds [ 195.833828] Freezing user space processes [ 215.841084] Freezing user space processes failed after 20.007 seconds (1 tasks refusing to freeze, wq_busy=0): [ 215.841255] task:vhost-3221 state:R stack:0 pid:3250 ppid:3221 flags:0x00004006 [ 215.841264] Call Trace: [ 215.841266] <TASK> [ 215.841270] ? update_rq_clock+0x39/0x270 [ 215.841283] ? _raw_spin_unlock+0x19/0x40 [ 215.841290] ? __schedule+0x3f/0x1510 [ 215.841296] ? sysvec_apic_timer_interrupt+0xaf/0xd0 [ 215.841306] ? schedule+0x61/0xe0 [ 215.841313] ? vhost_worker+0x87/0xb0 [vhost] [ 215.841329] ? vhost_task_fn+0x1a/0x30 [ 215.841336] ? __pfx_vhost_task_fn+0x10/0x10 [ 215.841341] ? ret_from_fork+0x2c/0x50 [ 215.841352] </TASK> [ 215.841936] OOM killer enabled. [ 215.841938] Restarting tasks ... done. [ 215.844204] random: crng reseeded on system resumption [ 215.957095] PM: suspend exit [ 215.957185] PM: suspend entry (s2idle) [ 215.967646] Filesystems sync: 0.010 seconds [ 215.971326] Freezing user space processes [ 235.974400] Freezing user space processes failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0): [ 235.974574] task:vhost-3221 state:R stack:0 pid:3250 ppid:3221 flags:0x00004806 [ 235.974583] Call Trace: [ 235.974586] <TASK> [ 235.974593] ? __schedule+0x184/0x1510 [ 235.974605] ? sysvec_apic_timer_interrupt+0xaf/0xd0 [ 235.974616] ? schedule+0x61/0xe0 [ 235.974624] ? vhost_worker+0x87/0xb0 [vhost] [ 235.974648] ? vhost_task_fn+0x1a/0x30 [ 235.974656] ? __pfx_vhost_task_fn+0x10/0x10 [ 235.974662] ? ret_from_fork+0x2c/0x50 [ 235.974673] </TASK> [ 235.975190] OOM killer enabled. [ 235.975192] Restarting tasks ... done. [ 235.978131] random: crng reseeded on system resumption [ 236.091219] PM: suspend exit ``` After running into the problem I booted 6.3.1-rc1 again and there s2idle still worked. Didn't do a bisection, just looked at the vhost commits during the latest merge window; 6e890c5d502 ("vhost: use vhost_tasks for worker threads") looked suspicious, so I reverted it on top of latest mainline and then things work again. Through a search on lore I arrived in this thread and found below patch from Mike. Gave it a try on top of latest mainline, but it didn't help. Ciao, Thorsten> [...] > If it's ok to change the behavior of "ps -u root", then we can do this patch: > (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps' > case. If you could test the ps only case or give me info on what /usr/bin/example > was doing I can replicate and test here): > > > diff --git a/kernel/fork.c b/kernel/fork.c > index ed4e01daccaa..eb9ffc58e211 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process( > /* > * Thread groups must share signals as well, and detached threads > * can only be started up within the thread group. > + * > + * A userworker's parent thread will normally have a signal handler > + * that performs management operations, but the worker will not > + * because the parent will handle the signal then user a worker > + * specific interface to manage the thread and related resources. > */ > - if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) > + if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) && > + !args->user_worker && !args->ignore_signals) > return ERR_PTR(-EINVAL); > > /* > diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c > index b7cbd66f889e..3700c21ea39d 100644 > --- a/kernel/vhost_task.c > +++ b/kernel/vhost_task.c > @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, > const char *name) > { > struct kernel_clone_args args = { > - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, > + .flags = CLONE_FS | CLONE_THREAD | CLONE_VM | > + CLONE_UNTRACED, > .exit_signal = 0, > .fn = vhost_task_fn, > .name = name