michael.christie at oracle.com
2023-Jun-04 03:28 UTC
[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 6/2/23 11:15 PM, Eric W. Biederman wrote:> > This fixes the ordering issue in vhost_task_fn so that get_signal > should not work. > > This patch is a gamble that during process exit or de_thread in exec > work will not be commonly queued from other threads. > > If this gamble turns out to be false the existing WARN_ON in > vhost_worker_free will fire. > > Can folks test this and let us know if the WARN_ON fires?I don't hit the WARN_ONs but probably not for the reason you are thinking of. We are hung like: Jun 03 22:25:23 ol4 kernel: Call Trace: Jun 03 22:25:23 ol4 kernel: <TASK> Jun 03 22:25:23 ol4 kernel: __schedule+0x334/0xac0 Jun 03 22:25:23 ol4 kernel: ? wait_for_completion+0x86/0x150 Jun 03 22:25:23 ol4 kernel: schedule+0x5a/0xd0 Jun 03 22:25:23 ol4 kernel: schedule_timeout+0x240/0x2a0 Jun 03 22:25:23 ol4 kernel: ? __wake_up_klogd.part.0+0x3c/0x60 Jun 03 22:25:23 ol4 kernel: ? vprintk_emit+0x104/0x270 Jun 03 22:25:23 ol4 kernel: ? wait_for_completion+0x86/0x150 Jun 03 22:25:23 ol4 kernel: wait_for_completion+0xb0/0x150 Jun 03 22:25:23 ol4 kernel: vhost_scsi_flush+0xc2/0xf0 [vhost_scsi] Jun 03 22:25:23 ol4 kernel: vhost_scsi_clear_endpoint+0x16f/0x240 [vhost_scsi] Jun 03 22:25:23 ol4 kernel: vhost_scsi_release+0x7d/0xf0 [vhost_scsi] Jun 03 22:25:23 ol4 kernel: __fput+0xa2/0x270 Jun 03 22:25:23 ol4 kernel: task_work_run+0x56/0xa0 Jun 03 22:25:23 ol4 kernel: do_exit+0x337/0xb40 Jun 03 22:25:23 ol4 kernel: ? __remove_hrtimer+0x39/0x70 Jun 03 22:25:23 ol4 kernel: do_group_exit+0x30/0x90 Jun 03 22:25:23 ol4 kernel: get_signal+0x9cd/0x9f0 Jun 03 22:25:23 ol4 kernel: ? kvm_arch_vcpu_put+0x12b/0x170 [kvm] Jun 03 22:25:23 ol4 kernel: ? vcpu_put+0x1e/0x50 [kvm] Jun 03 22:25:23 ol4 kernel: ? kvm_arch_vcpu_ioctl_run+0x193/0x4e0 [kvm] Jun 03 22:25:23 ol4 kernel: arch_do_signal_or_restart+0x2a/0x260 Jun 03 22:25:23 ol4 kernel: exit_to_user_mode_prepare+0xdd/0x120 Jun 03 22:25:23 ol4 kernel: syscall_exit_to_user_mode+0x1d/0x40 Jun 03 22:25:23 ol4 kernel: do_syscall_64+0x48/0x90 Jun 03 22:25:23 ol4 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc Jun 03 22:25:23 ol4 kernel: RIP: 0033:0x7f2d004df50b The problem is that as part of the flush the drivers/vhost/scsi.c code will wait for outstanding commands, because we can't free the device and it's resources before the commands complete or we will hit the accessing freed memory bug. We got hung because the patch had us now do: vhost_dev_flush() -> vhost_task_flush() and that saw VHOST_TASK_FLAGS_STOP was set and the exited completion has completed. However, the scsi code is still waiting on commands in vhost_scsi_flush. The cmds wanted to use the vhost_task task to complete and couldn't since the task has exited. To handle those types of issues above, it's a lot more code. We would add some rcu in vhost_work_queue to handle the worker being freed from under us. Then add a callback similar to what I did on one of the past patchsets that stops the drivers. Then modify scsi, so in the callback it also sets some bits so the completion paths just do a fast failing that doesn't try to queue the completion to the vhost_task. If we want to go that route, I can get it done in more like a 6.6 time frame.
Oleg Nesterov
2023-Jun-05 15:10 UTC
[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 06/03, michael.christie at oracle.com wrote:> > On 6/2/23 11:15 PM, Eric W. Biederman wrote: > The problem is that as part of the flush the drivers/vhost/scsi.c code > will wait for outstanding commands, because we can't free the device and > it's resources before the commands complete or we will hit the accessing > freed memory bug.ignoring send-fd/clone issues, can we assume that the final fput/release should always come from vhost_worker's sub-thread (which shares mm/etc) ? Oleg.