Linus Torvalds
2023-May-15 15:44 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Mon, May 15, 2023 at 7:23?AM Christian Brauner <brauner at kernel.org> wrote:> > So I think we will be able to address (1) and (2) by making vhost tasks > proper threads and blocking every signal except for SIGKILL and SIGSTOP > and then having vhost handle get_signal() - as you mentioned - the same > way io uring already does. We should also remove the ingore_signals > thing completely imho. I don't think we ever want to do this with user > workers.Right. That's what IO_URING does: if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't * fatal or STOP */ p->flags |= PF_IO_WORKER; siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); } and I really think that vhost should basically do exactly what io_uring does. Not because io_uring fundamentally got this right - but simply because io_uring had almost all the same bugs (and then some), and what the io_uring worker threads ended up doing was to basically zoom in on "this works". And it zoomed in on it largely by just going for "make it look as much as possible as a real user thread", because every time the kernel thread did something different, it just caused problems. So I think the patch should just look something like the attached. Mike, can you test this on whatever vhost test-suite? I did consider getting rid of ".ignore_signals" entirely, and instead just keying the "block signals" behavior off the ".user_worker" flag. But this approach doesn't seem wrong either, and I don't think it's wrong to make the create_io_thread() function say that ".ignore_signals = 1" thing explicitly, rather than key it off the ".io_thread" flag. Jens/Christian - comments? Slightly related to this all: I think vhost should also do CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if vhost doesn't use any files, it shouldn't matter, and looking different just to be different is wrong. But if vhost doesn't use any files, the current situation shouldn't be a bug either. Linus
Jens Axboe
2023-May-15 15:52 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/15/23 9:44?AM, Linus Torvalds wrote:> On Mon, May 15, 2023 at 7:23?AM Christian Brauner <brauner at kernel.org> wrote: >> >> So I think we will be able to address (1) and (2) by making vhost tasks >> proper threads and blocking every signal except for SIGKILL and SIGSTOP >> and then having vhost handle get_signal() - as you mentioned - the same >> way io uring already does. We should also remove the ingore_signals >> thing completely imho. I don't think we ever want to do this with user >> workers. > > Right. That's what IO_URING does: > > if (args->io_thread) { > /* > * Mark us an IO worker, and block any signal that isn't > * fatal or STOP > */ > p->flags |= PF_IO_WORKER; > siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); > } > > and I really think that vhost should basically do exactly what io_uring does. > > Not because io_uring fundamentally got this right - but simply because > io_uring had almost all the same bugs (and then some), and what the > io_uring worker threads ended up doing was to basically zoom in on > "this works". > > And it zoomed in on it largely by just going for "make it look as much > as possible as a real user thread", because every time the kernel > thread did something different, it just caused problems.This is exactly what I told Christian in a private chat too - we went through all of that, and this is what works. KISS.> So I think the patch should just look something like the attached. > Mike, can you test this on whatever vhost test-suite?Seems like that didn't get attached...> I did consider getting rid of ".ignore_signals" entirely, and instead > just keying the "block signals" behavior off the ".user_worker" flag. > But this approach doesn't seem wrong either, and I don't think it's > wrong to make the create_io_thread() function say that > ".ignore_signals = 1" thing explicitly, rather than key it off the > ".io_thread" flag. > > Jens/Christian - comments? > > Slightly related to this all: I think vhost should also do > CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if > vhost doesn't use any files, it shouldn't matter, and looking > different just to be different is wrong. But if vhost doesn't use any > files, the current situation shouldn't be a bug either.Only potential downside is that it does make file references more expensive for other syscalls, since you now have a shared file table. But probably not something to worry about here? -- Jens Axboe
Mike Christie
2023-May-15 22:23 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/15/23 10:44 AM, Linus Torvalds wrote:> On Mon, May 15, 2023 at 7:23?AM Christian Brauner <brauner at kernel.org> wrote: >> >> So I think we will be able to address (1) and (2) by making vhost tasks >> proper threads and blocking every signal except for SIGKILL and SIGSTOP >> and then having vhost handle get_signal() - as you mentioned - the same >> way io uring already does. We should also remove the ingore_signals >> thing completely imho. I don't think we ever want to do this with user >> workers. > > Right. That's what IO_URING does: > > if (args->io_thread) { > /* > * Mark us an IO worker, and block any signal that isn't > * fatal or STOP > */ > p->flags |= PF_IO_WORKER; > siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); > } > > and I really think that vhost should basically do exactly what io_uring does. > > Not because io_uring fundamentally got this right - but simply because > io_uring had almost all the same bugs (and then some), and what the > io_uring worker threads ended up doing was to basically zoom in on > "this works". > > And it zoomed in on it largely by just going for "make it look as much > as possible as a real user thread", because every time the kernel > thread did something different, it just caused problems. > > So I think the patch should just look something like the attached. > Mike, can you test this on whatever vhost test-suite?I tried that approach already and it doesn't work because io_uring and vhost differ in that vhost drivers implement a device where each device has a vhost_task and the drivers have a file_operations for the device. When the vhost_task's parent gets signal like SIGKILL, then it will exit and call into the vhost driver's file_operations->release function. At this time, we need to do cleanup like flush the device which uses the vhost_task. There is also the case where if the vhost_task gets a SIGKILL, we can just exit from under the vhost layer. io_uring has a similar cleanup issue where the core kernel code can't do exit for the io thread, but it only has that one point that it has to worry about so when it gets SIGKILL it can clean itself up then exit. So the patch in the other mail hits an issue where vhost_worker() can get into a tight loop hammering the CPU due to the pending SIGKILL signal. The vhost layer really doesn't want any signals and wants to work like kthreads for that case. To make it really simple can we do something like this where it separates user and io worker behavior where the major diff is how they handle signals and exit. I also included a fix for the freeze case: diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..e0f5ac90a228 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,6 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..fd2970b598b2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process( p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; - if (args->user_worker) + if (args->user_worker) { + /* + * User worker are similar to io_threads but they do not + * support signals and cleanup is driven via another kernel + * interface so even SIGKILL is blocked. + */ p->flags |= PF_USER_WORKER; + siginitsetinv(&p->blocked, 0); + } if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't @@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); + if (args->user_worker) + p->flags |= PF_NOFREEZE; stackleak_task_init(p); @@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn = fn, .fn_arg = arg, .io_thread = 1, - .user_worker = 1, }; return copy_process(NULL, 0, node, &args); diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..f2f1e5ef44b2 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p) return task_curr(p) || !task_sigpending(p); } +static void try_set_pending_sigkill(struct task_struct *t) +{ + /* + * User workers don't support signals and their exit is driven through + * their kernel layer, so do not send them SIGKILL. + */ + if (t->flags & PF_USER_WORKER) + return; + + sigaddset(&t->pending.signal, SIGKILL); + signal_wake_up(t, 1); +} + static void complete_signal(int sig, struct task_struct *p, enum pid_type type) { struct signal_struct *signal = p->signal; @@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } while_each_thread(p, t); return; } @@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p) /* Don't bother with already dead threads */ if (t->exit_state) continue; - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } return count; diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..2d8d3ebaec4d 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal = 0, .fn = vhost_task_fn, .name = name, .user_worker = 1, .no_files = 1, - .ignore_signals = 1, }; struct vhost_task *vtsk; struct task_struct *tsk; diff --git a/mm/vmscan.c b/mm/vmscan.c index d257916f39e5..255a2147e5c1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) DEFINE_WAIT(wait); /* - * Do not throttle user workers, kthreads other than kswapd or + * Do not throttle IO/user workers, kthreads other than kswapd or * workqueues. They may be required for reclaim to make * forward progress (e.g. journalling workqueues or kthreads). */ if (!current_is_kswapd() && - current->flags & (PF_USER_WORKER|PF_KTHREAD)) { + current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) { cond_resched(); return; }