Linus Torvalds
2023-May-15 22:54 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Mon, May 15, 2023 at 3:23?PM Mike Christie <michael.christie at oracle.com> wrote:> > The vhost layer really doesn't want any signals and wants to work like kthreads > for that case. To make it really simple can we do something like this where it > separates user and io worker behavior where the major diff is how they handle > signals and exit. I also included a fix for the freeze case:I don't love the SIGKILL special case, but I also don't find this deeply offensive. So if this is what it takes, I'm ok with it. I wonder if we could make that special case simply check for "is SIGKILL blocked" instead? No normal case will cause that, and it means that a PF_USER_WORKER thread could decide per-thread what it wants to do wrt SIGKILL. Christian? And I guess we should Cc: Oleg too, since the signal parts are an area he's familiar with and has worked on.. Eric Biederman has already been on the list and has also been involved Oleg: see https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d496 at oracle.com/ for the context here. Linus
Mike Christie
2023-May-16 03:53 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/15/23 5:54 PM, Linus Torvalds wrote:> On Mon, May 15, 2023 at 3:23?PM Mike Christie > <michael.christie at oracle.com> wrote: >> >> The vhost layer really doesn't want any signals and wants to work like kthreads >> for that case. To make it really simple can we do something like this where it >> separates user and io worker behavior where the major diff is how they handle >> signals and exit. I also included a fix for the freeze case: > > I don't love the SIGKILL special case, but I also don't find this > deeply offensive. So if this is what it takes, I'm ok with it. > > I wonder if we could make that special case simply check for "is > SIGKILL blocked" instead? No normal case will cause that, and it meansYeah, it's doable. Updated below.> that a PF_USER_WORKER thread could decide per-thread what it wants to > do wrt SIGKILL. > > Christian? And I guess we should Cc: Oleg too, since the signal parts > are an area he's familiar with and has worked on.. Eric Biederman has > already been on the list and has also been involved > > Oleg: see > > https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d496 at oracle.com/ > > for the context here.Oleg and Christian, Below is an updated patch that doesn't check for PF_USER_WORKER in the signal.c code and instead will check for if we have blocked the signal. diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..e0f5ac90a228 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,6 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..fd2970b598b2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process( p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; - if (args->user_worker) + if (args->user_worker) { + /* + * User worker are similar to io_threads but they do not + * support signals and cleanup is driven via another kernel + * interface so even SIGKILL is blocked. + */ p->flags |= PF_USER_WORKER; + siginitsetinv(&p->blocked, 0); + } if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't @@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); + if (args->user_worker) + p->flags |= PF_NOFREEZE; stackleak_task_init(p); @@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn = fn, .fn_arg = arg, .io_thread = 1, - .user_worker = 1, }; return copy_process(NULL, 0, node, &args); diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..bc7e26072437 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p) return task_curr(p) || !task_sigpending(p); } +static void try_set_pending_sigkill(struct task_struct *t) +{ + /* + * User workers don't support signals and their exit is driven through + * their kernel layer, so by default block even SIGKILL. + */ + if (sigismember(&t->blocked, SIGKILL)) + return; + + sigaddset(&t->pending.signal, SIGKILL); + signal_wake_up(t, 1); +} + static void complete_signal(int sig, struct task_struct *p, enum pid_type type) { struct signal_struct *signal = p->signal; @@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } while_each_thread(p, t); return; } @@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p) /* Don't bother with already dead threads */ if (t->exit_state) continue; - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } return count; diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..2d8d3ebaec4d 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal = 0, .fn = vhost_task_fn, .name = name, .user_worker = 1, .no_files = 1, - .ignore_signals = 1, }; struct vhost_task *vtsk; struct task_struct *tsk; diff --git a/mm/vmscan.c b/mm/vmscan.c index d257916f39e5..255a2147e5c1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) DEFINE_WAIT(wait); /* - * Do not throttle user workers, kthreads other than kswapd or + * Do not throttle IO/user workers, kthreads other than kswapd or * workqueues. They may be required for reclaim to make * forward progress (e.g. journalling workqueues or kthreads). */ if (!current_is_kswapd() && - current->flags & (PF_USER_WORKER|PF_KTHREAD)) { + current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) { cond_resched(); return; }
Eric W. Biederman
2023-May-16 15:56 UTC
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
Linus Torvalds <torvalds at linux-foundation.org> writes:> On Mon, May 15, 2023 at 3:23?PM Mike Christie > <michael.christie at oracle.com> wrote: >> >> The vhost layer really doesn't want any signals and wants to work like kthreads >> for that case. To make it really simple can we do something like this where it >> separates user and io worker behavior where the major diff is how they handle >> signals and exit. I also included a fix for the freeze case: > > I don't love the SIGKILL special case, but I also don't find this > deeply offensive. So if this is what it takes, I'm ok with it. > > I wonder if we could make that special case simply check for "is > SIGKILL blocked" instead? No normal case will cause that, and it means > that a PF_USER_WORKER thread could decide per-thread what it wants to > do wrt SIGKILL.A kernel thread can block SIGKILL and that is supported. For a thread that is part of a process you can't block SIGKILL when the task is part of a user mode process. There is this bit in complete_signal when SIGKILL is delivered to any thread in the process. /* * Start a group exit and wake everybody up. * This way we don't have other threads * running and doing things after a slower * thread has the fatal signal pending. */ signal->flags = SIGNAL_GROUP_EXIT; signal->group_exit_code = sig; signal->group_stop_count = 0; t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); } while_each_thread(p, t); For clarity that sigaddset(&t->pending.signal, SIGKILL); Really isn't setting SIGKILL pending, it is part of the short circuit delivery logic, and that sigaddset(SIGKILL) is just setting a flag to tell the process it needs to die. The important part of that code is that SIGNAL_GROUP_EXIT gets set. That indicates the entire process is being torn down. Where this becomes important is exit_notify and release_task work together to ensure that the first thread in the process (a user space thread that can not block SIGKILL) will not send SIGCHLD to it's parent process until every thread in the process has exited. The delay_group_leader logic in wait_consider_task part of wait(2) has the same logic. Having been through this with io_uring the threads really need to call get_signal to handle that case. This is pretty much why I said at the outset you they needed to decided if they were going to implement a thread or if they were going to be a process. Changing the decision to be a thread from a process is fine but in that case the vhost logic needs to act like a process, just like io_uring does.> Christian? And I guess we should Cc: Oleg too, since the signal parts > are an area he's familiar with and has worked on.. Eric Biederman has > already been on the list and has also been involved> > Oleg: see > > https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d496 at oracle.com/ > > for the context here.Eric