Oleg Nesterov
2023-Jun-02 17:58 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 06/02, Jason Wang wrote:> > On Thu, Jun 1, 2023 at 3:43?PM Oleg Nesterov <oleg at redhat.com> wrote: > > > > and the final rewrite: > > > > if (work->node) { > > work_next = work->node->next; > > if (true) > > clear_bit(&work->flags); > > } > > > > so again, I do not see the load-store control dependency. > > This kind of optimization is suspicious. Especially considering it's > the control expression of the loop but not a condition.It is not about optimization,> Looking at the assembly (x86): > > 0xffffffff81d46c5b <+75>: callq 0xffffffff81689ac0 <llist_reverse_order> > 0xffffffff81d46c60 <+80>: mov %rax,%r15 > 0xffffffff81d46c63 <+83>: test %rax,%rax > 0xffffffff81d46c66 <+86>: je 0xffffffff81d46c3a <vhost_worker+42> > 0xffffffff81d46c68 <+88>: mov %r15,%rdi > 0xffffffff81d46c6b <+91>: mov (%r15),%r15 > 0xffffffff81d46c6e <+94>: lock andb $0xfd,0x10(%rdi) > 0xffffffff81d46c73 <+99>: movl $0x0,0x18(%rbx) > 0xffffffff81d46c7a <+106>: mov 0x8(%rdi),%rax > 0xffffffff81d46c7e <+110>: callq 0xffffffff821b39a0 > <__x86_indirect_thunk_array> > 0xffffffff81d46c83 <+115>: callq 0xffffffff821b4d10 <__SCT__cond_resched> > ... > > I can see: > > 1) The code read node->next (+91) before clear_bit (+94)The code does. but what about CPU ?> 2) And the it uses a lock prefix to guarantee the execution orderAs I said from the very beginning, this code is fine on x86 because atomic ops are fully serialised on x86. OK. we can't convince each other. I'll try to write another email when I have time, If this code is correct, then my understanding of memory barriers is even worse than I think. I wouldn't be surprised, but I'd like to understand what I have missed. Oleg.
Linus Torvalds
2023-Jun-02 20:07 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On Fri, Jun 2, 2023 at 1:59?PM Oleg Nesterov <oleg at redhat.com> wrote:> > As I said from the very beginning, this code is fine on x86 because > atomic ops are fully serialised on x86.Yes. Other architectures require __smp_mb__{before,after}_atomic for the bit setting ops to actually be memory barriers. We *should* probably have acquire/release versions of the bit test/set helpers, but we don't, so they end up being full memory barriers with those things. Which isn't optimal, but I doubt it matters on most architectures. So maybe we'll some day have a "test_bit_acquire()" and a "set_bit_release()" etc. Linus