Linus Torvalds
2023-May-27 16:12 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On Sat, May 27, 2023 at 2:49?AM Eric W. Biederman <ebiederm at xmission.com> wrote:> > The real sticky widget for me is how to handle one of these processes > coredumping. It really looks like it will result in a reliable hang.Well, if *that* is the main worry, I think that's trivial enough to deal with. In particular, we could make the rule just be that user worker threads simply do not participate in core-dumps. THAT isn't hard. All we need to do is (a) not count those threads in zap_threads() (b) make sure that they don't add themselves to the "dumper" list by not calling "coredujmp_task_exit()" (c) not initiate core-dumping themselves. and I think that's pretty much it. In fact, that really seems like a good model *regardless*, because honestly, a PF_IO_WORKER doesn't have valid register state for the core dump anyway, and anything that would have caused a IO thread to get a SIGSEGV *should* have caused a kernel oops already. So the only worry is that the core dump will now happen while an IO worker is still busy and so it's not "atomic" wrt possible VM changes, but while that used to be a big problem back in the dark ages when we didn't get the VM locks for core dumping, that got fixed a few years ago because it already caused lots of potential issues. End result: I think the attached patch is probably missing something, but the approach "FeelsRight(tm)" to me. Comments? Linus -------------- next part -------------- A non-text attachment was scrubbed... Name: patch.diff Type: text/x-patch Size: 2133 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20230527/9dcb4dad/attachment-0001.bin>
Eric W. Biederman
2023-May-28 01:17 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
Linus Torvalds <torvalds at linux-foundation.org> writes:> On Sat, May 27, 2023 at 2:49?AM Eric W. Biederman <ebiederm at xmission.com> wrote: >> >> The real sticky widget for me is how to handle one of these processes >> coredumping. It really looks like it will result in a reliable hang. > > Well, if *that* is the main worry, I think that's trivial enough to deal with. > > In particular, we could make the rule just be that user worker threads > simply do not participate in core-dumps. > > THAT isn't hard. > > All we need to do is > > (a) not count those threads in zap_threads() > > (b) make sure that they don't add themselves to the "dumper" list by > not calling "coredujmp_task_exit()" > > (c) not initiate core-dumping themselves. > > and I think that's pretty much it. > > In fact, that really seems like a good model *regardless*, because > honestly, a PF_IO_WORKER doesn't have valid register state for the > core dump anyway, and anything that would have caused a IO thread to > get a SIGSEGV *should* have caused a kernel oops already. > > So the only worry is that the core dump will now happen while an IO > worker is still busy and so it's not "atomic" wrt possible VM changes, > but while that used to be a big problem back in the dark ages when we > didn't get the VM locks for core dumping, that got fixed a few years > ago because it already caused lots of potential issues.> > End result: I think the attached patch is probably missing something, > but the approach "FeelsRight(tm)" to me. > > Comments?It seems like a good approach for including in the -rc series. I think the change should look more like my change below. nits: - The threads all need to participate in the group exit even if they aren't going to be in the coredump. - For vhost_worker we need s/PF_IO_WORKER/PF_USER_WORKER/. - Moving PF_IO_WORKER above the sig_kernel_coredump(signr) test is unnecessary. The sig_kernel_coredump(signr) test can only become true if a process exit has not been initiated yet. More importantly moving the test obscures the fact that only do_group_exit is moved out of get_signal for the PF_IO_WORKER special case. Long term I think we want an approach that stops the worker threads during the coredumps. It will just yield a better quality of implementation if we minimize the amount of concurrency during the coredump. I have a pending patchset that moves the coredump rendezvous into get_signal. At which point stopping all of the threads is just like SIGSTOP something the worker threads can use and it won't introduce any issues. Today the problem is some of the worker thread code must run before the coredump stop. Looking forward I don't see not asking the worker threads to stop for the coredump right now causing any problems in the future. So I think we can use this to resolve the coredump issue I spotted. diff --git a/fs/coredump.c b/fs/coredump.c index ece7badf701b..620f7f9dc894 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -371,7 +371,8 @@ static int zap_process(struct task_struct *start, int exit_code) if (t != current && !(t->flags & PF_POSTCOREDUMP)) { sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); - nr++; + if (!(t->flags & PF_IO_WORKER)) + nr++; } } diff --git a/kernel/exit.c b/kernel/exit.c index 34b90e2e7cf7..6082dba9131a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -411,7 +411,9 @@ static void coredump_task_exit(struct task_struct *tsk) tsk->flags |= PF_POSTCOREDUMP; core_state = tsk->signal->core_state; spin_unlock_irq(&tsk->sighand->siglock); - if (core_state) { + + /* I/O Workers don't participate in coredumps */ + if (core_state && !(tsk->flags & PF_IO_WORKER) { struct core_thread self; self.task = current;> current->flags |= PF_SIGNALED; > > + /* > + * PF_IO_WORKER threads will catch and exit on fatal signals > + * themselves and do not participate in core dumping. > + * > + * They have cleanup that must be performed, so we cannot > + * call do_exit() on their behalf. > + */ > + if (current->flags & PF_IO_WORKER) > + goto out; > + > if (sig_kernel_coredump(signr)) { > if (print_fatal_signals) > print_fatal_signal(ksig->info.si_signo); > @@ -2860,14 +2870,6 @@ bool get_signal(struct ksignal *ksig) > do_coredump(&ksig->info); > } > > - /* > - * PF_IO_WORKER threads will catch and exit on fatal signals > - * themselves. They have cleanup that must be performed, so > - * we cannot call do_exit() on their behalf. > - */ > - if (current->flags & PF_IO_WORKER) > - goto out; > - > /* > * Death signals, no core dump. > */Eric