thr3ads.net - Virtualization - [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec

If this information is useful, please help other people find it:
Share via:

Eric W. Biederman

2023-May-18 18:28 UTC

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

Oleg Nesterov <oleg at redhat.com> writes:
> On 05/18, Mike Christie wrote:
>>
>> On 5/18/23 11:25 AM, Oleg Nesterov wrote:
>> > I too do not understand the 1st change in this patch ...
>> >
>> > On 05/18, Mike Christie wrote:
>> >>
>> >> In the other patches we do:
>> >>
>> >> if (get_signal(ksig))
>> >> 	start_exit_cleanup_by_stopping_newIO()
>> >> 	flush running IO()
>> >> 	exit()
>> >>
>> >> But to do the flush running IO() part of this I need to wait
for it so
>> >> that's why I wanted to be able to dequeue the SIGKILL and
clear the
>> >> TIF_SIGPENDING bit.
>> >
>> > But get_signal() will do what you need, dequeue SIGKILL and clear
SIGPENDING ?
>> >
>> > 	if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>> > 	     signal->group_exec_task) {
>> > 		clear_siginfo(&ksig->info);
>> > 		ksig->info.si_signo = signr = SIGKILL;
>> > 		sigdelset(&current->pending.signal, SIGKILL);
>> >
>> > this "dequeues" SIGKILL,
>
> OOPS. this doesn't remove SIGKILL from
current->signal->shared_pending
Neither does calling get_signal the first time.
But the second time get_signal is called it should work.

Leaving SIGKILL in current->signal->shared_pending when it has already
been short circuit delivered appears to be an out and out bug.
>> >
>> > 		trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>> > 			&sighand->action[SIGKILL - 1]);
>> > 		recalc_sigpending();
>> >
>> > this clears TIF_SIGPENDING.
>
> No, I was wrong, recalc_sigpending() won't clear TIF_SIGPENDING if
> SIGKILL is in signal->shared_pending
That feels wrong as well.
>> I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
>> I'll dig into why.
>
> See above, sorry for confusion.
>
>
>
> And again, there is another problem with SIGSTOP. To simplify, suppose
> a PF_IO_WORKER thread does something like
>
> 	while (signal_pending(current))
> 		get_signal(...);
>
> this will loop forever if (SIGNAL_GROUP_EXIT || group_exec_task) and
> SIGSTOP is pending.
I think we want to do something like the untested diff below.

That the PF_IO_WORKER test allows get_signal to be called
after get_signal returns a fatal aka SIGKILL seems wrong.
That doesn't happen in the io_uring case, and certainly nowhere
else.

The change to complete_signal appears obviously correct although
a pending siginfo still needs to be handled.

The change to recalc_siginfo also appears mostly right, but I am not
certain that the !freezing test is in the proper place.  Nor am I
certain it won't have other surprise effects.

Still the big issue seems to be the way get_signal is connected into
these threads so that it keeps getting called.  Calling get_signal after
a fatal signal has been returned happens nowhere else and even if we fix
it today it is likely to lead to bugs in the future because whoever is
testing and updating the code is unlikely they have a vhost test case
the care about.

diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..4d54718cad36 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
 
 void recalc_sigpending(void)
 {
-       if (!recalc_sigpending_tsk(current) && !freezing(current))
+       if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
+           ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
+                   !__fatal_signal_pending(current)))
                clear_thread_flag(TIF_SIGPENDING);
 
 }
@@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct
*p, enum pid_type type)
                 * This signal will be fatal to the whole group.
                 */
                if (!sig_kernel_coredump(sig)) {
+                       /*
+                        * The signal is being short circuit delivered
+                        * don't it pending.
+                        */
+                       if (type != PIDTYPE_PID) {
+                               sigdelset(&t->signal->shared_pending, 
sig);
+
                        /*
                         * Start a group exit and wake everybody up.
                         * This way we don't have other threads



Eric

Mike Christie

2023-May-18 22:57 UTC

head link

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

On 5/18/23 1:28 PM, Eric W. Biederman wrote:> Still the big issue seems to be the way get_signal is connected into
> these threads so that it keeps getting called.  Calling get_signal after
> a fatal signal has been returned happens nowhere else and even if we fix
> it today it is likely to lead to bugs in the future because whoever is
> testing and updating the code is unlikely they have a vhost test case
> the care about.
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..4d54718cad36 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>  
>  void recalc_sigpending(void)
>  {
> -       if (!recalc_sigpending_tsk(current) && !freezing(current))
> +       if ((!recalc_sigpending_tsk(current) && !freezing(current))
||
> +           ((current->signal->flags & SIGNAL_GROUP_EXIT)
&&
> +                   !__fatal_signal_pending(current)))
>                 clear_thread_flag(TIF_SIGPENDING);
>  
>  }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct
task_struct *p, enum pid_type type)
>                  * This signal will be fatal to the whole group.
>                  */
>                 if (!sig_kernel_coredump(sig)) {
> +                       /*
> +                        * The signal is being short circuit delivered
> +                        * don't it pending.
> +                        */
> +                       if (type != PIDTYPE_PID) {
> +                              
sigdelset(&t->signal->shared_pending,  sig);
> +
>                         /*
>                          * Start a group exit and wake everybody up.
>                          * This way we don't have other threads
> 
If I change up your patch so the last part is moved down a bit to when we set t
like this:

diff --git a/kernel/signal.c b/kernel/signal.c
index 0ac48c96ab04..c976a80650db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
 
 void recalc_sigpending(void)
 {
-	if (!recalc_sigpending_tsk(current) && !freezing(current))
+	if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
+	    ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
+	     !__fatal_signal_pending(current)))
 		clear_thread_flag(TIF_SIGPENDING);
-
 }
 EXPORT_SYMBOL(recalc_sigpending);
 
@@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct
*p, enum pid_type type)
 			signal->group_exit_code = sig;
 			signal->group_stop_count = 0;
 			t = p;
+			/*
+			 * The signal is being short circuit delivered
+			 * don't it pending.
+			 */
+			if (type != PIDTYPE_PID) {
+				struct sigpending *pending;
+
+				pending = &t->signal->shared_pending;
+				sigdelset(&pending->signal, sig);
+			}
+
 			do {
 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
 				sigaddset(&t->pending.signal, SIGKILL);


Then get_signal() works like how Oleg mentioned it should earlier.

For vhost I just need the code below which is just Linus's patch plus a call
to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER
change.

Note that when we get SIGKILL, the vhost file_operations->release function is
called via

            do_exit -> exit_files -> put_files_struct -> close_files

and so the vhost release function starts to flush IO and stop the worker/vhost
task. In vhost_worker() then we just handle those last completions for already
running IO. When  the vhost release function detects they are done it does
vhost_task_stop() and vhost_worker() returns and then vhost_task_fn() does
do_exit().
So we don't return immediately when get_signal() returns non-zero.

So it works, but it sounds like you don't like vhost relying on the
behavior,
and it's non standard to use get_signal() like we are. So I'm not sure
how we
want to proceed.

Maybe the safest is to revert:

commit 6e890c5d5021ca7e69bbe203fde42447874d9a82
Author: Mike Christie <michael.christie at oracle.com>
Date:   Fri Mar 10 16:03:32 2023 -0600

    vhost: use vhost_tasks for worker threads

and retry this for the next kernel when we can do proper testing and more
code review?


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
 		}
 
 		node = llist_del_all(&worker->work_list);
-		if (!node)
+		if (!node) {
 			schedule();
+			/*
+			 * When we get a SIGKILL our release function will
+			 * be called. That will stop new IOs from being queued
+			 * and check for outstanding cmd responses. It will then
+			 * call vhost_task_stop to exit us.
+			 */
+			vhost_task_get_signal();
+		}
 
 		node = llist_reverse_order(node);
 		/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
-	u32 ignore_signals:1;
+	u32 block_signals:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..79bf0ed4ded0 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void
*arg,
 void vhost_task_start(struct vhost_task *vtsk);
 void vhost_task_stop(struct vhost_task *vtsk);
 bool vhost_task_should_stop(struct vhost_task *vtsk);
+void vhost_task_get_signal(void);
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
 		p->flags |= PF_KTHREAD;
 	if (args->user_worker)
 		p->flags |= PF_USER_WORKER;
-	if (args->io_thread) {
-		/*
-		 * Mark us an IO worker, and block any signal that isn't
-		 * fatal or STOP
-		 */
+	if (args->io_thread)
 		p->flags |= PF_IO_WORKER;
+	if (args->block_signals)
 		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
-	}
 
 	if (args->name)
 		strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	if (args->ignore_signals)
-		ignore_signals(p);
-
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *),
void *arg, int node)
 		.fn_arg		= arg,
 		.io_thread	= 1,
 		.user_worker	= 1,
+		.block_signals	= 1,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..0ac48c96ab04 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2861,11 +2861,11 @@ bool get_signal(struct ksignal *ksig)
 		}
 
 		/*
-		 * PF_IO_WORKER threads will catch and exit on fatal signals
+		 * PF_USER_WORKER threads will catch and exit on fatal signals
 		 * themselves. They have cleanup that must be performed, so
 		 * we cannot call do_exit() on their behalf.
 		 */
-		if (current->flags & PF_IO_WORKER)
+		if (current->flags & PF_USER_WORKER)
 			goto out;
 
 		/*
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..82467f450f0d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -31,22 +31,13 @@ static int vhost_task_fn(void *data)
  */
 void vhost_task_stop(struct vhost_task *vtsk)
 {
-	pid_t pid = vtsk->task->pid;
-
 	set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
 	wake_up_process(vtsk->task);
 	/*
 	 * Make sure vhost_task_fn is no longer accessing the vhost_task before
-	 * freeing it below. If userspace crashed or exited without closing,
-	 * then the vhost_task->task could already be marked dead so
-	 * kernel_wait will return early.
+	 * freeing it below.
 	 */
 	wait_for_completion(&vtsk->exited);
-	/*
-	 * If we are just closing/removing a device and the parent process is
-	 * not exiting then reap the task.
-	 */
-	kernel_wait4(pid, NULL, __WCLONE, NULL);
 	kfree(vtsk);
 }
 EXPORT_SYMBOL_GPL(vhost_task_stop);
@@ -61,6 +52,25 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
 }
 EXPORT_SYMBOL_GPL(vhost_task_should_stop);
 
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * This checks if there are signals and will handle freezes requests. For
+ * SIGKILL, out file_operations->release is already being called when we
+ * see the signal, so we let release call vhost_task_stop to tell the
+ * vhost_task to exit when it's done using the task.
+ */
+void vhost_task_get_signal(void)
+{
+	struct ksignal ksig;
+
+	if (!signal_pending(current))
+		return;
+
+	get_signal(&ksig);
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
 /**
  * vhost_task_create - create a copy of a process to be used by the kernel
  * @fn: thread stack
@@ -75,13 +85,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void
*arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
-		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+				  CLONE_THREAD | CLONE_SIGHAND,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,
 		.user_worker	= 1,
 		.no_files	= 1,
-		.ignore_signals	= 1,
+		.block_signals	= 1,
 	};
 	struct vhost_task *vtsk;
 	struct task_struct *tsk;

Oleg Nesterov

2023-May-22 13:30 UTC

head link

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

On 05/18, Eric W. Biederman wrote:>
>  void recalc_sigpending(void)
>  {
> -       if (!recalc_sigpending_tsk(current) && !freezing(current))
> +       if ((!recalc_sigpending_tsk(current) && !freezing(current))
||
> +           ((current->signal->flags & SIGNAL_GROUP_EXIT)
&&
> +                   !__fatal_signal_pending(current)))
>                 clear_thread_flag(TIF_SIGPENDING);
>  
>  }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct
task_struct *p, enum pid_type type)
>                  * This signal will be fatal to the whole group.
>                  */
>                 if (!sig_kernel_coredump(sig)) {
> +                       /*
> +                        * The signal is being short circuit delivered
> +                        * don't it pending.
> +                        */
> +                       if (type != PIDTYPE_PID) {
> +                              
sigdelset(&t->signal->shared_pending,  sig);
> +
>                         /*
>                          * Start a group exit and wake everybody up.
>                          * This way we don't have other threads
Eric, sorry. I fail to understand this patch.

How can it help? And whom?

Perhaps we can discuss it in the context of the new series from Mike?

Oleg.

Virtualization - May 2023 - [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set