xuejiufei
2016-Jan-11 02:46 UTC
[Ocfs2-devel] ocfs2: A race between refmap setting and clearing
Hi all,
We have found a race between refmap setting and clearing which
will cause the lock resource on master is freed before other nodes
purge it.
Node 1 Node 2(master)
dlm_do_master_request
dlm_master_request_handler
-> dlm_lockres_set_refmap_bit
call dlm_purge_lockres after unlock
dlm_deref_handler
-> find lock resource is in
DLM_LOCK_RES_SETREF_INPROG state,
so dispatch a deref work
dlm_purge_lockres succeed.
dlm_do_master_request
dlm_master_request_handler
-> dlm_lockres_set_refmap_bit
deref work trigger, call
dlm_lockres_clear_refmap_bit
to clear Node 1 from refmap
Now Node 2 can purge the lock resource but the owner of lock resource
on Node 1 is still Node 2 which may trigger BUG if the lock resource
is $RECOVERY or other problems.
We have discussed 2 solutions:
1)The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG
is set. Node 1 will not retry and master send another message to Node 1
after clearing the refmap. Node 1 can purge the lock resource after the
refmap on master is cleared.
2) The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG
is set, and Node 1 will retry to deref the lockres.
Does anybody has better ideas?
Thanks,
--Jiufei
Junxiao Bi
2016-Jan-12 04:03 UTC
[Ocfs2-devel] ocfs2: A race between refmap setting and clearing
Hi Jiufei, On 01/11/2016 10:46 AM, xuejiufei wrote:> Hi all, > We have found a race between refmap setting and clearing which > will cause the lock resource on master is freed before other nodes > purge it. > > Node 1 Node 2(master) > dlm_do_master_request > dlm_master_request_handler > -> dlm_lockres_set_refmap_bit > call dlm_purge_lockres after unlock > dlm_deref_handler > -> find lock resource is in > DLM_LOCK_RES_SETREF_INPROG state, > so dispatch a deref work > dlm_purge_lockres succeed. > > dlm_do_master_request > dlm_master_request_handler > -> dlm_lockres_set_refmap_bit > > deref work trigger, call > dlm_lockres_clear_refmap_bit > to clear Node 1 from refmap > > Now Node 2 can purge the lock resource but the owner of lock resource > on Node 1 is still Node 2 which may trigger BUG if the lock resource > is $RECOVERY or other problems. > > We have discussed 2 solutions: > 1)The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG > is set. Node 1 will not retry and master send another message to Node 1 > after clearing the refmap. Node 1 can purge the lock resource after the > refmap on master is cleared. > 2) The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG > is set, and Node 1 will retry to deref the lockres. > > Does anybody has better ideas? >dlm_purge_lockres() will wait to drop ref until DLM_LOCK_RES_SETREF_INPROG cleared. So if set this flag when find the master during doing master request. And then this flag was cleared when receiving assert master message, can this fix the issue? Thanks, Junxiao.> Thanks, > --Jiufei >
Junxiao Bi
2016-Jan-21 07:34 UTC
[Ocfs2-devel] ocfs2: A race between refmap setting and clearing
Hi Jiufei, I didn't find other solution for this issue. You can go with yours. Looks like your second one is more straightforward, there deref work can be removed. Thanks, Junxiao. On 01/11/2016 10:46 AM, xuejiufei wrote:> Hi all, > We have found a race between refmap setting and clearing which > will cause the lock resource on master is freed before other nodes > purge it. > > Node 1 Node 2(master) > dlm_do_master_request > dlm_master_request_handler > -> dlm_lockres_set_refmap_bit > call dlm_purge_lockres after unlock > dlm_deref_handler > -> find lock resource is in > DLM_LOCK_RES_SETREF_INPROG state, > so dispatch a deref work > dlm_purge_lockres succeed. > > dlm_do_master_request > dlm_master_request_handler > -> dlm_lockres_set_refmap_bit > > deref work trigger, call > dlm_lockres_clear_refmap_bit > to clear Node 1 from refmap > > Now Node 2 can purge the lock resource but the owner of lock resource > on Node 1 is still Node 2 which may trigger BUG if the lock resource > is $RECOVERY or other problems. > > We have discussed 2 solutions: > 1)The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG > is set. Node 1 will not retry and master send another message to Node 1 > after clearing the refmap. Node 1 can purge the lock resource after the > refmap on master is cleared. > 2) The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG > is set, and Node 1 will retry to deref the lockres. > > Does anybody has better ideas? > > Thanks, > --Jiufei >