thr3ads.net - Linux Virtualization - [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression [Jun 2023]

If this information is useful, please help other people find it:
Share via:

Oleg Nesterov

2023-Jun-06 19:39 UTC

[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/06, Mike Christie wrote:>
> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
> > On 06/05, Mike Christie wrote:
> >
> >> So it works like if we were using a kthread still:
> >>
> >> 1. Userapce thread0 opens /dev/vhost-$something.
> >> 2. thread0 does VHOST_SET_OWNER ioctl. This calls
vhost_task_create() to
> >> create the task_struct which runs the vhost_worker() function
which handles
> >> the work->fns.
> >> 3. If userspace now does a SIGKILL or just exits without doing a
close() on
> >> /dev/vhost-$something, then when thread0 does exit_files() that
will do the
> >> fput that does vhost-$something's file_operations->release.
> >
> > So, at least in this simple case vhost_worker() can just exit after
SIGKILL,
> > and thread0 can flush the outstanding commands when it calls
vhost_dev_flush()
> > rather than wait for vhost_worker().
> >
> > Right?
>
> With the current code, the answer is no. We would hang like I mentioned
here:
>
> https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c69c at
oracle.com/
If only I could fully understand this email ;)

Could you spell to explain why this can't work (again, in this simple case)
?

My current (and I know, very poor) understanding is that .release() should
roughly do the following:

	1. Ensure that vhost_work_queue() can't add the new callbacks

	2. Call vhost_dev_flush() to ensure that worker->work_list is empty

	3. Call vhost_task_stop()

so why this sequence can't work if we turn vhost_dev_flush() into something
like

	void vhost_dev_flush(struct vhost_dev *dev)
	{
		struct vhost_flush_struct flush;

		if (dev->worker) {
			// this assumes that vhost_task_create() uses CLONE_THREAD
			if (same_thread_group(current, dev->worker->vtsk->task)) {
				... run the pending callbacks ...
				return;
			}


			// this is what we currently have

			init_completion(&flush.wait_event);
			vhost_work_init(&flush.work, vhost_flush_work);

			vhost_work_queue(dev, &flush.work);
			wait_for_completion(&flush.wait_event);
		}
	}

?

Mike, I am just trying to understand what exactly vhost_worker() should do.
> We need to add code like I mentioned in that reply because we don't
have a
> way to call into the layers below us to flush those commands.
This tells me nothing, but this is my fault, not yours. Again, again, I know
nothing about drivers/vhost.

Oleg.

Mike Christie

2023-Jun-06 20:38 UTC

head link

[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 6/6/23 2:39 PM, Oleg Nesterov wrote:> On 06/06, Mike Christie wrote:
>>
>> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
>>> On 06/05, Mike Christie wrote:
>>>
>>>> So it works like if we were using a kthread still:
>>>>
>>>> 1. Userapce thread0 opens /dev/vhost-$something.
>>>> 2. thread0 does VHOST_SET_OWNER ioctl. This calls
vhost_task_create() to
>>>> create the task_struct which runs the vhost_worker() function
which handles
>>>> the work->fns.
>>>> 3. If userspace now does a SIGKILL or just exits without doing
a close() on
>>>> /dev/vhost-$something, then when thread0 does exit_files() that
will do the
>>>> fput that does vhost-$something's
file_operations->release.
>>>
>>> So, at least in this simple case vhost_worker() can just exit after
SIGKILL,
>>> and thread0 can flush the outstanding commands when it calls
vhost_dev_flush()
>>> rather than wait for vhost_worker().
>>>
>>> Right?
>>
>> With the current code, the answer is no. We would hang like I mentioned
here:
>>
>> https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c69c at
oracle.com/
> 
> If only I could fully understand this email ;)
> 
> Could you spell to explain why this can't work (again, in this simple
case) ?
> 
> My current (and I know, very poor) understanding is that .release() should
> roughly do the following:
> 
> 	1. Ensure that vhost_work_queue() can't add the new callbacks
> 
> 	2. Call vhost_dev_flush() to ensure that worker->work_list is empty
> 
The problem is what do we do in the work->fn.

What you wrote is correct for cleaning up the work_list. However, the lower
level
vhost drivers, like vhost-scsi, will do something like:

async_submit_request_to_storage/net_layer()

from their work->fn. The submission is async so when the request completes it
calls some callbacks that call into the vhost driver and vhost layer. For
vhost-scsi the call back will run vhost_queue_work so we can complete the
request
from the vhost_task.

So if we've already run the work->fn then we need to add code to handle
the
completion of the request we submitted. We need:

1. vhost_queue_work needs some code to detect when the vhost_task has exited
so we don't do vhost_task_wake on a freed task.

I was saying for this, we can sprinkle some RCU in there and in the code paths
we cleanup the vhost_task.

2. The next problem is that if the vhost_task is going to just loop over the
work_list and kill those works before it exits (or if we do it from the
vhost_dev_flush
function), then we still have handle those async requests that got kicked off to
some other layer that are going to eventually complete and try to call
vhost_work_queue.

With #1, we can detect when the vhost_task is no longer usable, so we then need
to modify the drivers to detect that and instead of trying to execute like
normal
where they queue the work, they just take their failure paths and free
resources.

So the release cabllback was doing 2 things:
1. Flushing the work_list
2. Waiting on the those request completions

And so I was saying before I'm trying to finish up handling #2. I hit some
hiccups though because it turns out there is at least one case where we
don't have a vhost_task but we don't want to fail. It's just a
matter of
coding it though.

Eric W. Biederman

2023-Jun-11 20:27 UTC

head link

[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Oleg Nesterov <oleg at redhat.com> writes:
> On 06/06, Mike Christie wrote:
>>
>> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
>> > On 06/05, Mike Christie wrote:
>> >
>> >> So it works like if we were using a kthread still:
>> >>
>> >> 1. Userapce thread0 opens /dev/vhost-$something.
>> >> 2. thread0 does VHOST_SET_OWNER ioctl. This calls
vhost_task_create() to
>> >> create the task_struct which runs the vhost_worker() function
which handles
>> >> the work->fns.
>> >> 3. If userspace now does a SIGKILL or just exits without doing
a close() on
>> >> /dev/vhost-$something, then when thread0 does exit_files()
that will do the
>> >> fput that does vhost-$something's
file_operations->release.
>> >
>> > So, at least in this simple case vhost_worker() can just exit
after SIGKILL,
>> > and thread0 can flush the outstanding commands when it calls
vhost_dev_flush()
>> > rather than wait for vhost_worker().
>> >
>> > Right?
>>
>> With the current code, the answer is no. We would hang like I mentioned
here:
>>
>> https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c69c at
oracle.com/
>
> If only I could fully understand this email ;)
>
> Could you spell to explain why this can't work (again, in this simple
case) ?
>
> My current (and I know, very poor) understanding is that .release() should
> roughly do the following:
>
> 	1. Ensure that vhost_work_queue() can't add the new callbacks
>
> 	2. Call vhost_dev_flush() to ensure that worker->work_list is empty
>
> 	3. Call vhost_task_stop()

At least in the case of exec by the time the final fput happens
from close_on_exec the task has already changed it's mm.  So the
conditions are wrong to run the work queue items.

For close(2) and SIGKILL perhaps, but definitely not in the case of
exec.


Eric

Linux Virtualization - Jun 2023 - [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression