Jason Wang
2023-Jun-02 05:03 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On Thu, Jun 1, 2023 at 3:43?PM Oleg Nesterov <oleg at redhat.com> wrote:> > On 06/01, Jason Wang wrote: > > > > On Wed, May 31, 2023 at 5:14?PM Oleg Nesterov <oleg at redhat.com> wrote: > > > > > > > > I don't understand you. OK, to simplify, suppose we have 2 global vars > > > > > > > > > > void *PTR = something_non_null; > > > > > unsigned long FLAGS = -1ul; > > > > > > > > > > Now I think this code > > > > > > > > > > CPU_0 CPU_1 > > > > > > > > > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS)) > > > > > clear_bit(0, FLAGS); PTR = NULL; > > > > > BUG_ON(!ptr); > > > > > > > > > > is racy and can hit the BUG_ON(!ptr). > > > > > > > > This seems different to the above case? > > > > > > not sure, > > > > > > > And you can hit BUG_ON with > > > > the following execution sequence: > > > > > > > > [cpu 0] clear_bit(0, FLAGS); > > > > [cpu 1] if (!test_and_set_bit(0, FLAGS)) > > > > [cpu 1] PTR = NULL; > > > > [cpu 0] BUG_ON(!ptr) > > > > > > I don't understand this part... yes, we can hit this BUG_ON() without mb in > > > between, this is what I tried to say. > > > > I may miss something, > > Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before clear_bit. > Since you have mentioned the program order: yes this lacks READ_ONCE() or barrier(), > but the same is true for the code in vhost_worker(). So I still don't understand. > > > but the above is the sequence that is executed > > by the processor (for each CPU, it's just the program order). So where > > do you expect to place an mb can help? > > before clear_bit: > > CPU_0 > > void *ptr = PTR; > mb(); // implies compiler barrier as well > clear_bit(0, FLAGS); > BUG_ON(!ptr); > > just in case... mb() in the code above is only for illustration, we can use > smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the > one-way barrier is fine in this case.Ok, but it seems different, in the case of vhost we had a condition above the clear_bit().> > > > > > In vhost code, there's a condition before the clear_bit() which sits > > > > inside llist_for_each_entry_safe(): > > > > > > > > #define llist_for_each_entry_safe(pos, n, node, member) \ > > > > for (pos = llist_entry((node), typeof(*pos), member); \ > > > > member_address_is_nonnull(pos, member) && \ > > > > (n = llist_entry(pos->member.next, typeof(*n), member), true); \ > > > > pos = n) > > > > > > > > The clear_bit() is a store which is not speculated, so there's a > > > > control dependency, the store can't be executed until the condition > > > > expression is evaluated which requires pos->member.next > > > > (work->node.next) to be loaded. > > > > > > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have > > > something like > > > > > > n = llist_entry(...); > > > if (n) > > > clear_bit(...); > > > > > > so I do not see how can we rely on the load-store control dependency. > > > > Just to make sure we are on the same page, the condition expression is > > > > member_address_is_nonnull(pos, member) && (n > > llist_entry(pos->member.next, typeof(*n), member), true) > > > > So it's something like: > > > > if (work->node && (work_next = work->node->next, true)) > > clear_bit(&work->flags); > > > > So two loads from both work->node and work->node->next, and there's a > > store which is clear_bit, then it's a load-store control dependencies? > > I guess you missed the comma expression...Probably not, see below:> Let me rewrite your pseudo-code > above, it is equivalent to > > if (work->node) { > if ((work_next = work->node->next, true)) > clear_bit(&work->flags); > } > > another rewrite: > > if (work->node) { > work_next = work->node->next; > if ((work, true)) > clear_bit(&work->flags); > } > > and the final rewrite: > > if (work->node) { > work_next = work->node->next; > if (true) > clear_bit(&work->flags); > } > > so again, I do not see the load-store control dependency.This kind of optimization is suspicious. Especially considering it's the control expression of the loop but not a condition. Looking at the assembly (x86): 0xffffffff81d46c5b <+75>: callq 0xffffffff81689ac0 <llist_reverse_order> 0xffffffff81d46c60 <+80>: mov %rax,%r15 0xffffffff81d46c63 <+83>: test %rax,%rax 0xffffffff81d46c66 <+86>: je 0xffffffff81d46c3a <vhost_worker+42> 0xffffffff81d46c68 <+88>: mov %r15,%rdi 0xffffffff81d46c6b <+91>: mov (%r15),%r15 0xffffffff81d46c6e <+94>: lock andb $0xfd,0x10(%rdi) 0xffffffff81d46c73 <+99>: movl $0x0,0x18(%rbx) 0xffffffff81d46c7a <+106>: mov 0x8(%rdi),%rax 0xffffffff81d46c7e <+110>: callq 0xffffffff821b39a0 <__x86_indirect_thunk_array> 0xffffffff81d46c83 <+115>: callq 0xffffffff821b4d10 <__SCT__cond_resched> ... I can see: 1) The code read node->next (+91) before clear_bit (+94) 2) And the it uses a lock prefix to guarantee the execution order> Not to mention this > code lacks READ_ONCE().Work_next is loaded only once for temporary storage, so I don't see why we need that. Thanks> > > If we had something like > > if (work->node) { > work_next = READ_ONCE(work->node->next); > if (work_next) > clear_bit(&work->flags); > } > > instead, then yes, we could rely on the LOAD-STORE dependency. > > Oleg. >
Oleg Nesterov
2023-Jun-02 17:58 UTC
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 06/02, Jason Wang wrote:> > On Thu, Jun 1, 2023 at 3:43?PM Oleg Nesterov <oleg at redhat.com> wrote: > > > > and the final rewrite: > > > > if (work->node) { > > work_next = work->node->next; > > if (true) > > clear_bit(&work->flags); > > } > > > > so again, I do not see the load-store control dependency. > > This kind of optimization is suspicious. Especially considering it's > the control expression of the loop but not a condition.It is not about optimization,> Looking at the assembly (x86): > > 0xffffffff81d46c5b <+75>: callq 0xffffffff81689ac0 <llist_reverse_order> > 0xffffffff81d46c60 <+80>: mov %rax,%r15 > 0xffffffff81d46c63 <+83>: test %rax,%rax > 0xffffffff81d46c66 <+86>: je 0xffffffff81d46c3a <vhost_worker+42> > 0xffffffff81d46c68 <+88>: mov %r15,%rdi > 0xffffffff81d46c6b <+91>: mov (%r15),%r15 > 0xffffffff81d46c6e <+94>: lock andb $0xfd,0x10(%rdi) > 0xffffffff81d46c73 <+99>: movl $0x0,0x18(%rbx) > 0xffffffff81d46c7a <+106>: mov 0x8(%rdi),%rax > 0xffffffff81d46c7e <+110>: callq 0xffffffff821b39a0 > <__x86_indirect_thunk_array> > 0xffffffff81d46c83 <+115>: callq 0xffffffff821b4d10 <__SCT__cond_resched> > ... > > I can see: > > 1) The code read node->next (+91) before clear_bit (+94)The code does. but what about CPU ?> 2) And the it uses a lock prefix to guarantee the execution orderAs I said from the very beginning, this code is fine on x86 because atomic ops are fully serialised on x86. OK. we can't convince each other. I'll try to write another email when I have time, If this code is correct, then my understanding of memory barriers is even worse than I think. I wouldn't be surprised, but I'd like to understand what I have missed. Oleg.