thr3ads.net - Virtualization - [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression [May 2023]

If this information is useful, please help other people find it:
Share via:

Eric W. Biederman

2023-May-23 15:57 UTC

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Oleg Nesterov <oleg at redhat.com> writes:
> On 05/22, Oleg Nesterov wrote:
>>
>> Right now I think that "int dead" should die,
>
> No, probably we shouldn't call get_signal() if we have already
> dequeued SIGKILL.
Very much agreed.  It is one thing to add a patch to move do_exit
out of get_signal.  It is another to keep calling get_signal after
that.  Nothing tests that case, and so we get some weird behaviors.

>> but let me think tomorrow.
>
> May be something like this... I don't like it but I can't suggest
anything better
> right now.
>
> 	bool killed = false;
>
> 	for (;;) {
> 		...
> 	
> 		node = llist_del_all(&worker->work_list);
> 		if (!node) {
> 			schedule();
> 			/*
> 			 * When we get a SIGKILL our release function will
> 			 * be called. That will stop new IOs from being queued
> 			 * and check for outstanding cmd responses. It will then
> 			 * call vhost_task_stop to tell us to return and exit.
> 			 */
> 			if (signal_pending(current)) {
> 				struct ksignal ksig;
>
> 				if (!killed)
> 					killed = get_signal(&ksig);
>
> 				clear_thread_flag(TIF_SIGPENDING);
> 			}
>
> 			continue;
> 		}
I want to point out that we need to consider not just SIGKILL, but
SIGABRT that causes a coredump, as well as the process peforming
an ordinary exit(2).  All of which will cause get_signal to return
SIGKILL in this context.
>
>
-------------------------------------------------------------------------------
> But let me ask a couple of questions.
I share most of these questions.
> Let's forget this patch, let's look at the
> current code:
>
> 		node = llist_del_all(&worker->work_list);
> 		if (!node)
> 			schedule();
>
> 		node = llist_reverse_order(node);
> 		... process works ...
>
> To me this looks a bit confusing. Shouldn't we do
>
> 		if (!node) {
> 			schedule();
> 			continue;
> 		}
>
> just to make the code a bit more clear? If node == NULL then
> llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
> But this is minor.
>
>
>
> 		/* make sure flag is seen after deletion */
> 		smp_wmb();
> 		llist_for_each_entry_safe(work, work_next, node, node) {
> 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> vhost_work_queue() can add this work again and change
work->node->next.
>
> That is why we use _safe, but we need to ensure that llist_for_each_safe()
> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
>
> So it seems that smp_wmb() can't help and should be removed, instead we
need
>
> 		llist_for_each_entry_safe(...) {
> 			smp_mb__before_atomic();
> 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> Also, if the work->fn pointer is not stable, we should read it before
> smp_mb__before_atomic() as well.
>
> No?
>
>
> 			__set_current_state(TASK_RUNNING);
>
> Why do we set TASK_RUNNING inside the loop? Does this mean that
work->fn()
> can return with current->state != RUNNING ?
>
>
> 			work->fn(work);
>
> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> before we call work->fn(). Is it "safe" to run this callback
with
> signal_pending() or fatal_signal_pending() ?
>
>
> Finally. I never looked into drivers/vhost/ before so I don't
understand
> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
> to run the pending callbacks rather than wait for vhost_worker() ?
> I guess we can't, ->mm won't be correct, but can you confirm?
In a conversation long ago I remember hearing that vhost does not
support file descriptor passing.  Which means all of the file
descriptors should be in the same process.

Looking at the vhost code what I am seeing happening is that the
vhost_worker persists until vhost_dev_cleanup is called from
one of the vhost_???_release() functions.  The release functions
are only called after the last flush function completes.  See __fput
if you want to trace the details.


On one hand this all seems reasonable.  On the other hand I am not
seeing the code that prevents file descriptor passing.


It is probably not the worst thing in the world, but what this means
is now if you pass a copy of the vhost file descriptor to another
process the vhost_worker will persist, and thus the process will persist
until that copy of the file descriptor is closed.

Eric

Oleg Nesterov

2023-May-24 14:10 UTC

head link

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/23, Eric W. Biederman wrote:>
> I want to point out that we need to consider not just SIGKILL, but
> SIGABRT that causes a coredump, as well as the process peforming
> an ordinary exit(2).  All of which will cause get_signal to return
> SIGKILL in this context.
Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt
vhost_worker().
> It is probably not the worst thing in the world, but what this means
> is now if you pass a copy of the vhost file descriptor to another
> process the vhost_worker will persist, and thus the process will persist
> until that copy of the file descriptor is closed.
Hadn't thought about it.

I am fighting with internal bugzillas today, will try to write another
email tomorrow.

But before that, I would like to have an answer to my "main" question
in
my previois email. Otherwise I am still not sure I understand what exactly
we need to fix.

Oleg.

Jason Wang

2023-May-31 05:22 UTC

head link

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

? 2023/5/23 23:57, Eric W. Biederman ??:> Oleg Nesterov <oleg at redhat.com> writes:
>
>> On 05/22, Oleg Nesterov wrote:
>>> Right now I think that "int dead" should die,
>> No, probably we shouldn't call get_signal() if we have already
>> dequeued SIGKILL.
> Very much agreed.  It is one thing to add a patch to move do_exit
> out of get_signal.  It is another to keep calling get_signal after
> that.  Nothing tests that case, and so we get some weird behaviors.
>
>
>>> but let me think tomorrow.
>> May be something like this... I don't like it but I can't
suggest anything better
>> right now.
>>
>> 	bool killed = false;
>>
>> 	for (;;) {
>> 		...
>> 	
>> 		node = llist_del_all(&worker->work_list);
>> 		if (!node) {
>> 			schedule();
>> 			/*
>> 			 * When we get a SIGKILL our release function will
>> 			 * be called. That will stop new IOs from being queued
>> 			 * and check for outstanding cmd responses. It will then
>> 			 * call vhost_task_stop to tell us to return and exit.
>> 			 */
>> 			if (signal_pending(current)) {
>> 				struct ksignal ksig;
>>
>> 				if (!killed)
>> 					killed = get_signal(&ksig);
>>
>> 				clear_thread_flag(TIF_SIGPENDING);
>> 			}
>>
>> 			continue;
>> 		}
> I want to point out that we need to consider not just SIGKILL, but
> SIGABRT that causes a coredump, as well as the process peforming
> an ordinary exit(2).  All of which will cause get_signal to return
> SIGKILL in this context.
>
>>
-------------------------------------------------------------------------------
>> But let me ask a couple of questions.
> I share most of these questions.
>
>> Let's forget this patch, let's look at the
>> current code:
>>
>> 		node = llist_del_all(&worker->work_list);
>> 		if (!node)
>> 			schedule();
>>
>> 		node = llist_reverse_order(node);
>> 		... process works ...
>>
>> To me this looks a bit confusing. Shouldn't we do
>>
>> 		if (!node) {
>> 			schedule();
>> 			continue;
>> 		}
>>
>> just to make the code a bit more clear? If node == NULL then
>> llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
>> But this is minor.
>>
>>
>>
>> 		/* make sure flag is seen after deletion */
>> 		smp_wmb();
>> 		llist_for_each_entry_safe(work, work_next, node, node) {
>> 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
>>
>> I am not sure about smp_wmb + clear_bit. Once we clear
VHOST_WORK_QUEUED,
>> vhost_work_queue() can add this work again and change
work->node->next.
>>
>> That is why we use _safe, but we need to ensure that
llist_for_each_safe()
>> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is
cleared.
>>
>> So it seems that smp_wmb() can't help and should be removed,
instead we need
>>
>> 		llist_for_each_entry_safe(...) {
>> 			smp_mb__before_atomic();
>> 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
>>
>> Also, if the work->fn pointer is not stable, we should read it
before
>> smp_mb__before_atomic() as well.
>>
>> No?
>>
>>
>> 			__set_current_state(TASK_RUNNING);
>>
>> Why do we set TASK_RUNNING inside the loop? Does this mean that
work->fn()
>> can return with current->state != RUNNING ?
>>
>>
>> 			work->fn(work);
>>
>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come
right
>> before we call work->fn(). Is it "safe" to run this
callback with
>> signal_pending() or fatal_signal_pending() ?
>>
>>
>> Finally. I never looked into drivers/vhost/ before so I don't
understand
>> this code at all, but let me ask anyway... Can we change
vhost_dev_flush()
>> to run the pending callbacks rather than wait for vhost_worker() ?
>> I guess we can't, ->mm won't be correct, but can you
confirm?
> In a conversation long ago I remember hearing that vhost does not
> support file descriptor passing.  Which means all of the file
> descriptors should be in the same process.

It's not. Actually passing vhost fd is pretty common since Qemu is 
usually running without privilege. So it's the charge of the management 
layer to open vhost fd and pass it to Qemu.

>
> Looking at the vhost code what I am seeing happening is that the
> vhost_worker persists until vhost_dev_cleanup is called from
> one of the vhost_???_release() functions.  The release functions
> are only called after the last flush function completes.  See __fput
> if you want to trace the details.
>
>
> On one hand this all seems reasonable.  On the other hand I am not
> seeing the code that prevents file descriptor passing.

Yes.

>
>
> It is probably not the worst thing in the world, but what this means
> is now if you pass a copy of the vhost file descriptor to another
> process the vhost_worker will persist, and thus the process will persist
> until that copy of the file descriptor is closed.

Right.

Thanks

>
> Eric
>

Apparently Analagous Threads

Search for more reasonably related threads

Virtualization - May 2023 - [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Apparently Analagous Threads