Mike Christie
2023-Jun-06 20:38 UTC
[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 6/6/23 2:39 PM, Oleg Nesterov wrote:> On 06/06, Mike Christie wrote: >> >> On 6/6/23 7:16 AM, Oleg Nesterov wrote: >>> On 06/05, Mike Christie wrote: >>> >>>> So it works like if we were using a kthread still: >>>> >>>> 1. Userapce thread0 opens /dev/vhost-$something. >>>> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to >>>> create the task_struct which runs the vhost_worker() function which handles >>>> the work->fns. >>>> 3. If userspace now does a SIGKILL or just exits without doing a close() on >>>> /dev/vhost-$something, then when thread0 does exit_files() that will do the >>>> fput that does vhost-$something's file_operations->release. >>> >>> So, at least in this simple case vhost_worker() can just exit after SIGKILL, >>> and thread0 can flush the outstanding commands when it calls vhost_dev_flush() >>> rather than wait for vhost_worker(). >>> >>> Right? >> >> With the current code, the answer is no. We would hang like I mentioned here: >> >> https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c69c at oracle.com/ > > If only I could fully understand this email ;) > > Could you spell to explain why this can't work (again, in this simple case) ? > > My current (and I know, very poor) understanding is that .release() should > roughly do the following: > > 1. Ensure that vhost_work_queue() can't add the new callbacks > > 2. Call vhost_dev_flush() to ensure that worker->work_list is empty >The problem is what do we do in the work->fn. What you wrote is correct for cleaning up the work_list. However, the lower level vhost drivers, like vhost-scsi, will do something like: async_submit_request_to_storage/net_layer() from their work->fn. The submission is async so when the request completes it calls some callbacks that call into the vhost driver and vhost layer. For vhost-scsi the call back will run vhost_queue_work so we can complete the request from the vhost_task. So if we've already run the work->fn then we need to add code to handle the completion of the request we submitted. We need: 1. vhost_queue_work needs some code to detect when the vhost_task has exited so we don't do vhost_task_wake on a freed task. I was saying for this, we can sprinkle some RCU in there and in the code paths we cleanup the vhost_task. 2. The next problem is that if the vhost_task is going to just loop over the work_list and kill those works before it exits (or if we do it from the vhost_dev_flush function), then we still have handle those async requests that got kicked off to some other layer that are going to eventually complete and try to call vhost_work_queue. With #1, we can detect when the vhost_task is no longer usable, so we then need to modify the drivers to detect that and instead of trying to execute like normal where they queue the work, they just take their failure paths and free resources. So the release cabllback was doing 2 things: 1. Flushing the work_list 2. Waiting on the those request completions And so I was saying before I'm trying to finish up handling #2. I hit some hiccups though because it turns out there is at least one case where we don't have a vhost_task but we don't want to fail. It's just a matter of coding it though.
I am sad my idea for simplifying things did not work out. Let's try an even bigger idea to reduce maintenance and simplify things. Could vhost depend on io_uring? Could vhost just be a translation layer of existing vhost requests to io_uring requests? At a quick glance it looks like io_uring already supports the functionality that vhost supports (which I think is networking and scsi). If vhost could become a translation layer that would allow removing the vhost worker and PF_USER_WORKER could be removed completely, leaving only PF_IO_WORKER. I suggest this because a significant vhost change is needed because in the long term the hacks in exec and coredump are not a good idea. Which means something like my failed "[PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression". If we have to go to all of the trouble of reworking things it why can't we just make io_uring do all of the work? Eric