thr3ads.net - Virtualization - [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression [Jun 2023]

If this information is useful, please help other people find it:
Share via:

Jason Wang

2023-Jun-02 05:03 UTC

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Thu, Jun 1, 2023 at 3:43?PM Oleg Nesterov <oleg at redhat.com>
wrote:>
> On 06/01, Jason Wang wrote:
> >
> > On Wed, May 31, 2023 at 5:14?PM Oleg Nesterov <oleg at
redhat.com> wrote:
> > >
> > > > > I don't understand you. OK, to simplify, suppose we
have 2 global vars
> > > > >
> > > > >         void *PTR = something_non_null;
> > > > >         unsigned long FLAGS = -1ul;
> > > > >
> > > > > Now I think this code
> > > > >
> > > > >         CPU_0                           CPU_1
> > > > >
> > > > >         void *ptr = PTR;                if
(!test_and_set_bit(0, FLAGS))
> > > > >         clear_bit(0, FLAGS);                    PTR =
NULL;
> > > > >         BUG_ON(!ptr);
> > > > >
> > > > > is racy and can hit the BUG_ON(!ptr).
> > > >
> > > > This seems different to the above case?
> > >
> > > not sure,
> > >
> > > > And you can hit BUG_ON with
> > > > the following execution sequence:
> > > >
> > > > [cpu 0] clear_bit(0, FLAGS);
> > > > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > > > [cpu 1] PTR = NULL;
> > > > [cpu 0] BUG_ON(!ptr)
> > >
> > > I don't understand this part... yes, we can hit this BUG_ON()
without mb in
> > > between, this is what I tried to say.
> >
> > I may miss something,
>
> Or me... note that CPU_0 loads the global "PTR" into the local
"ptr" before clear_bit.
> Since you have mentioned the program order: yes this lacks READ_ONCE() or
barrier(),
> but the same is true for the code in vhost_worker(). So I still don't
understand.
>
> > but the above is the sequence that is executed
> > by the processor (for each CPU, it's just the program order). So
where
> > do you expect to place an mb can help?
>
> before clear_bit:
>
>         CPU_0
>
>         void *ptr = PTR;
>         mb();                   // implies compiler barrier as well
>         clear_bit(0, FLAGS);
>         BUG_ON(!ptr);
>
> just in case... mb() in the code above is only for illustration, we can use
> smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the
> one-way barrier is fine in this case.
Ok, but it seems different, in the case of vhost we had a condition
above the clear_bit().
>
>
> > > > In vhost code, there's a condition before the
clear_bit() which sits
> > > > inside llist_for_each_entry_safe():
> > > >
> > > > #define llist_for_each_entry_safe(pos, n, node, member)     
\
> > > >         for (pos = llist_entry((node), typeof(*pos),
member);                  \
> > > >              member_address_is_nonnull(pos, member)
&&                         \
> > > >                 (n = llist_entry(pos->member.next,
typeof(*n), member), true); \
> > > >              pos = n)
> > > >
> > > > The clear_bit() is a store which is not speculated, so
there's a
> > > > control dependency, the store can't be executed until
the condition
> > > > expression is evaluated which requires pos->member.next
> > > > (work->node.next) to be loaded.
> > >
> > > But llist_for_each_entry_safe() doesn't check "n",
I mean, it is not that we have
> > > something like
> > >
> > >         n = llist_entry(...);
> > >         if (n)
> > >                 clear_bit(...);
> > >
> > > so I do not see how can we rely on the load-store control
dependency.
> >
> > Just to make sure we are on the same page, the condition expression is
> >
> > member_address_is_nonnull(pos, member) && (n > >
llist_entry(pos->member.next, typeof(*n), member), true)
> >
> > So it's something like:
> >
> > if (work->node && (work_next = work->node->next,
true))
> >     clear_bit(&work->flags);
> >
> > So two loads from both work->node and work->node->next, and
there's a
> > store which is clear_bit, then it's a load-store control
dependencies?
>
> I guess you missed the comma expression...
Probably not, see below:
> Let me rewrite your pseudo-code
> above, it is equivalent to
>
>         if (work->node) {
>                 if ((work_next = work->node->next, true))
>                         clear_bit(&work->flags);
>         }
>
> another rewrite:
>
>         if (work->node) {
>                 work_next = work->node->next;
>                 if ((work, true))
>                         clear_bit(&work->flags);
>         }
>
> and the final rewrite:
>
>         if (work->node) {
>                 work_next = work->node->next;
>                 if (true)
>                         clear_bit(&work->flags);
>         }
>
> so again, I do not see the load-store control dependency.
This kind of optimization is suspicious. Especially considering it's
the control expression of the loop but not a condition.

Looking at the assembly (x86):

   0xffffffff81d46c5b <+75>:    callq  0xffffffff81689ac0
<llist_reverse_order>
   0xffffffff81d46c60 <+80>:    mov    %rax,%r15
   0xffffffff81d46c63 <+83>:    test   %rax,%rax
   0xffffffff81d46c66 <+86>:    je     0xffffffff81d46c3a
<vhost_worker+42>
   0xffffffff81d46c68 <+88>:    mov    %r15,%rdi
   0xffffffff81d46c6b <+91>:    mov    (%r15),%r15
   0xffffffff81d46c6e <+94>:    lock andb $0xfd,0x10(%rdi)
   0xffffffff81d46c73 <+99>:    movl   $0x0,0x18(%rbx)
   0xffffffff81d46c7a <+106>:   mov    0x8(%rdi),%rax
   0xffffffff81d46c7e <+110>:   callq  0xffffffff821b39a0
<__x86_indirect_thunk_array>
   0xffffffff81d46c83 <+115>:   callq  0xffffffff821b4d10
<__SCT__cond_resched>
...

I can see:

1) The code read node->next (+91) before clear_bit (+94)
2) And the it uses a lock prefix to guarantee the execution order
> Not to mention this
> code lacks READ_ONCE().
Work_next is loaded only once for temporary storage, so I don't see
why we need that.

Thanks


>
>
> If we had something like
>
>         if (work->node) {
>                 work_next = READ_ONCE(work->node->next);
>                 if (work_next)
>                         clear_bit(&work->flags);
>         }
>
> instead, then yes, we could rely on the LOAD-STORE dependency.
>
> Oleg.
>

Oleg Nesterov

2023-Jun-02 17:58 UTC

head link

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/02, Jason Wang wrote:>
> On Thu, Jun 1, 2023 at 3:43?PM Oleg Nesterov <oleg at redhat.com>
wrote:
> >
> > and the final rewrite:
> >
> >         if (work->node) {
> >                 work_next = work->node->next;
> >                 if (true)
> >                         clear_bit(&work->flags);
> >         }
> >
> > so again, I do not see the load-store control dependency.
>
> This kind of optimization is suspicious. Especially considering it's
> the control expression of the loop but not a condition.
It is not about optimization,
> Looking at the assembly (x86):
>
>    0xffffffff81d46c5b <+75>:    callq  0xffffffff81689ac0
<llist_reverse_order>
>    0xffffffff81d46c60 <+80>:    mov    %rax,%r15
>    0xffffffff81d46c63 <+83>:    test   %rax,%rax
>    0xffffffff81d46c66 <+86>:    je     0xffffffff81d46c3a
<vhost_worker+42>
>    0xffffffff81d46c68 <+88>:    mov    %r15,%rdi
>    0xffffffff81d46c6b <+91>:    mov    (%r15),%r15
>    0xffffffff81d46c6e <+94>:    lock andb $0xfd,0x10(%rdi)
>    0xffffffff81d46c73 <+99>:    movl   $0x0,0x18(%rbx)
>    0xffffffff81d46c7a <+106>:   mov    0x8(%rdi),%rax
>    0xffffffff81d46c7e <+110>:   callq  0xffffffff821b39a0
> <__x86_indirect_thunk_array>
>    0xffffffff81d46c83 <+115>:   callq  0xffffffff821b4d10
<__SCT__cond_resched>
> ...
>
> I can see:
>
> 1) The code read node->next (+91) before clear_bit (+94)
The code does. but what about CPU ?
> 2) And the it uses a lock prefix to guarantee the execution order
As I said from the very beginning, this code is fine on x86 because
atomic ops are fully serialised on x86.

OK. we can't convince each other. I'll try to write another email when
I have time,

If this code is correct, then my understanding of memory barriers is even
worse than I think. I wouldn't be surprised, but I'd like to understand
what I have missed.

Oleg.

Virtualization - Jun 2023 - [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression