thr3ads.net - Ocfs2 devel - [Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock. [Feb 2012]

If this information is useful, please help other people find it:
Share via:

xiaowei.hu at oracle.com

2012-Feb-21 06:12 UTC

[Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.

I am trying to fix bug13611997,CT's machine run into BUG in ocfs2dc thread,
BUG_ON(lockres->l_action != OCFS2_AST_CONVERT && lockres->l_action
!= OCFS2_AST_DOWNCONVERT); I analysized the vmcore , the lockres->l_action =
OCFS2_AST_ATTACH and l_flags=326(which means
OCFS2_LOCK_BUSY|OCFS2_LOCK_BLOCKED|OCFS2_LOCK_INITIALIZED|OCFS2_LOCK_QUEUED),
after compared with the code , this status could be only possible during
ocfs2_cluster_lock,here is the race situation:

NodeA								NodeB
ocfs2_cluster_lock on a new lockres M
spin_lock_irqsave(&lockres->l_lock, flags);
gen = lockres_set_pending(lockres);
lockres->l_action = OCFS2_AST_ATTACH;
lockres_or_flags(lockres, OCFS2_LOCK_BUSY);
spin_unlock_irqrestore(&lockres->l_lock, flags);

ocfs2_dlm_lock() finished and returned.
**and lockres_clear_pending(lockres, gen, osb);
							request a lock on the same lockres M
							It's blocked by nodeA, and a ast proxy was send to A

bast queued and flushed,before the ast was queued
then the ocfs2dc was scheduled
there is a chance to execute this code path:
ocfs2_downconvert_thread()
ocfs2_downconvert_thread_do_work()
ocfs2_blocking_ast()
ocfs2_process_blocked_lock()
ocfs2_unblock_lock()
	spin_lock_irqsave(&lockres->l_lock, flags);
	if (lockres->l_flags & OCFS2_LOCK_BUSY)
	    ret = ocfs2_prepare_cancel_convert(osb, lockres);
		BUG_ON(lockres->l_action != OCFS2_AST_CONVERT &&
               	       lockres->l_action != OCFS2_AST_DOWNCONVERT);
		here trigger the BUG()

Solution:
One possible solution for this is to remove the lockres_clear_pending marked by
2 stars, and left this clear work to the ast function.In this way could make
sure the bast function wait for ast , let it clear OCFS2_LOCK_BUSY and set
OCFS2_LOCK_ATTACHED first, before enter downconvert process.

xiaowei.hu at oracle.com

2012-Feb-21 06:12 UTC

head link

[Ocfs2-devel] [PATCH] fixing dlmglue race condition

From: Xiaowei.Hu <xiaowei.hu at oracle.com>

NodeA							NodeB
ocfs2_cluster_lock on a new lockres M
spin_lock_irqsave(&lockres->l_lock, flags);
gen = lockres_set_pending(lockres);
lockres->l_action = OCFS2_AST_ATTACH;
lockres_or_flags(lockres, OCFS2_LOCK_BUSY);
spin_unlock_irqrestore(&lockres->l_lock, flags);

ocfs2_dlm_lock() finished and returned.
**and lockres_clear_pending(lockres, gen, osb);
							request a lock
							on the same
							lockres M
							It's blocked by
							nodeA, and a
							ast proxy was
							send to A

bast queued and flushed,before the ast was queued
then the ocfs2dc was scheduled
there is a chance to execute this code path,since
pending flag was cleared already:
ocfs2_downconvert_thread()
ocfs2_downconvert_thread_do_work()
ocfs2_blocking_ast()
ocfs2_process_blocked_lock()
ocfs2_unblock_lock()
	spin_lock_irqsave(&lockres->l_lock, flags);
	if (lockres->l_flags & OCFS2_LOCK_BUSY)
	    ret = ocfs2_prepare_cancel_convert(osb, lockres);
		BUG_ON(lockres->l_action != OCFS2_AST_CONVERT &&
               	       lockres->l_action != OCFS2_AST_DOWNCONVERT);
		here trigger the BUG()
---
 fs/ocfs2/dlmglue.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 81a4cd2..6f5e516 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -1471,7 +1471,6 @@ again:
 				     lkm_flags,
 				     lockres->l_name,
 				     OCFS2_LOCK_ID_MAX_LEN - 1);
-		lockres_clear_pending(lockres, gen, osb);
 		if (ret) {
 			if (!(lkm_flags & DLM_LKF_NOQUEUE) ||
 			    (ret != -EAGAIN)) {
-- 
1.7.4.4

Sunil Mushran

2012-Feb-21 17:48 UTC

head link

[Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.

> bast queued and flushed,before the ast was queued
Unlikely with o2dlm. dlmthread always sends ASTs before BASTs.

Can you recreate the entire lockres? A full dump may yield more
information.

Sunil

On 02/20/2012 10:12 PM, xiaowei.hu at oracle.com wrote:> I am trying to fix bug13611997,CT's machine run into BUG in ocfs2dc
thread, BUG_ON(lockres->l_action != OCFS2_AST_CONVERT&& 
lockres->l_action != OCFS2_AST_DOWNCONVERT); I analysized the vmcore , the
lockres->l_action = OCFS2_AST_ATTACH and l_flags=326(which means
OCFS2_LOCK_BUSY|OCFS2_LOCK_BLOCKED|OCFS2_LOCK_INITIALIZED|OCFS2_LOCK_QUEUED),
after compared with the code , this status could be only possible during
ocfs2_cluster_lock,here is the race situation:
>
> NodeA								NodeB
> ocfs2_cluster_lock on a new lockres M
> spin_lock_irqsave(&lockres->l_lock, flags);
> gen = lockres_set_pending(lockres);
> lockres->l_action = OCFS2_AST_ATTACH;
> lockres_or_flags(lockres, OCFS2_LOCK_BUSY);
> spin_unlock_irqrestore(&lockres->l_lock, flags);
>
> ocfs2_dlm_lock() finished and returned.
> **and lockres_clear_pending(lockres, gen, osb);
> 							request a lock on the same lockres M
> 							It's blocked by nodeA, and a ast proxy was send to A
>
> bast queued and flushed,before the ast was queued
> then the ocfs2dc was scheduled
> there is a chance to execute this code path:
> ocfs2_downconvert_thread()
> ocfs2_downconvert_thread_do_work()
> ocfs2_blocking_ast()
> ocfs2_process_blocked_lock()
> ocfs2_unblock_lock()
> 	spin_lock_irqsave(&lockres->l_lock, flags);
> 	if (lockres->l_flags&  OCFS2_LOCK_BUSY)
> 	    ret = ocfs2_prepare_cancel_convert(osb, lockres);
> 		BUG_ON(lockres->l_action != OCFS2_AST_CONVERT&&
>                 	lockres->l_action != OCFS2_AST_DOWNCONVERT);
> 		here trigger the BUG()
>
> Solution:
> One possible solution for this is to remove the lockres_clear_pending
marked by 2 stars, and left this clear work to the ast function.In this way
could make sure the bast function wait for ast , let it clear OCFS2_LOCK_BUSY
and set OCFS2_LOCK_ATTACHED first, before enter downconvert process.
>
>

Sunil Mushran

2012-Feb-21 18:04 UTC

head link

[Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.

Moreover what is lockres_clear_pending doing in 1.4. That code
is not meant for 1.4. It fixes a problem associated with fsdlm.
It was left out of 1.4 for a reason.

Meaning this bug was introduced by the patch that introduced this
one in 1.4.

On 02/20/2012 10:12 PM, xiaowei.hu at oracle.com wrote:> I am trying to fix bug13611997,CT's machine run into BUG in ocfs2dc
thread, BUG_ON(lockres->l_action != OCFS2_AST_CONVERT&& 
lockres->l_action != OCFS2_AST_DOWNCONVERT); I analysized the vmcore , the
lockres->l_action = OCFS2_AST_ATTACH and l_flags=326(which means
OCFS2_LOCK_BUSY|OCFS2_LOCK_BLOCKED|OCFS2_LOCK_INITIALIZED|OCFS2_LOCK_QUEUED),
after compared with the code , this status could be only possible during
ocfs2_cluster_lock,here is the race situation:
>
> NodeA								NodeB
> ocfs2_cluster_lock on a new lockres M
> spin_lock_irqsave(&lockres->l_lock, flags);
> gen = lockres_set_pending(lockres);
> lockres->l_action = OCFS2_AST_ATTACH;
> lockres_or_flags(lockres, OCFS2_LOCK_BUSY);
> spin_unlock_irqrestore(&lockres->l_lock, flags);
>
> ocfs2_dlm_lock() finished and returned.
> **and lockres_clear_pending(lockres, gen, osb);
> 							request a lock on the same lockres M
> 							It's blocked by nodeA, and a ast proxy was send to A
>
> bast queued and flushed,before the ast was queued
> then the ocfs2dc was scheduled
> there is a chance to execute this code path:
> ocfs2_downconvert_thread()
> ocfs2_downconvert_thread_do_work()
> ocfs2_blocking_ast()
> ocfs2_process_blocked_lock()
> ocfs2_unblock_lock()
> 	spin_lock_irqsave(&lockres->l_lock, flags);
> 	if (lockres->l_flags&  OCFS2_LOCK_BUSY)
> 	    ret = ocfs2_prepare_cancel_convert(osb, lockres);
> 		BUG_ON(lockres->l_action != OCFS2_AST_CONVERT&&
>                 	lockres->l_action != OCFS2_AST_DOWNCONVERT);
> 		here trigger the BUG()
>
> Solution:
> One possible solution for this is to remove the lockres_clear_pending
marked by 2 stars, and left this clear work to the ast function.In this way
could make sure the bast function wait for ast , let it clear OCFS2_LOCK_BUSY
and set OCFS2_LOCK_ATTACHED first, before enter downconvert process.
>
>

Ocfs2 devel - Feb 2012 - Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.

[Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.

[Ocfs2-devel] [PATCH] fixing dlmglue race condition

[Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.

[Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.