Joseph Qi
2015-Sep-21 09:23 UTC
[Ocfs2-devel] [PATCH v2] ocfs2/dlm: fix race between convert and recovery
Hi Junxiao, This solution may have problem in the following scenario: Consider dlm_send_remote_convert_request has taken too much time, and dlm_move_lockres_to_recovery_list runs first and new master will see this node is currently in convert list after recovery. Then dlm_send_remote_convert_request returns other than DLM_NORMAL and it will revert it to grant list, then retry convert. This will makes this node and master inconsistent. I will try to find another solution to resolve the race issue. On 2015/9/20 15:20, Junxiao Bi wrote:> Reviewed-by: Junxiao Bi <junxiao.bi at oracle.com> > >> > ? 2015?9?18????7:25?Joseph Qi <joseph.qi at huawei.com> ??? >> > >> > There is a race window between dlmconvert_remote and >> > dlm_move_lockres_to_recovery_list, which will cause a lock with >> > OCFS2_LOCK_BUSY in grant list, thus system hangs. >> > >> > dlmconvert_remote >> > { >> > spin_lock(&res->spinlock); >> > list_move_tail(&lock->list, &res->converting); >> > lock->convert_pending = 1; >> > spin_unlock(&res->spinlock); >> > >> > status = dlm_send_remote_convert_request(); >>>>>>>> >>>>>>> race window, master has queued ast and return DLM_NORMAL, >> > and then down before sending ast. >> > this node detects master down and calls >> > dlm_move_lockres_to_recovery_list, which will revert the >> > lock to grant list. >> > Then OCFS2_LOCK_BUSY won't be cleared as new master won't >> > send ast any more because it thinks already be authorized. >> > >> > spin_lock(&res->spinlock); >> > lock->convert_pending = 0; >> > if (status != DLM_NORMAL) >> > dlm_revert_pending_convert(res, lock); >> > spin_unlock(&res->spinlock); >> > } >> > >> > In this case, just leave it in convert list and new master will take care >> > of it after recovery. And if convert request returns other than >> > DLM_NORMAL, convert thread will do the revert itself. So remove the >> > revert logic in dlm_move_lockres_to_recovery_list. >> > >> > changelog since v1: >> > Clean up convert_pending since it is now useless.
Junxiao Bi
2015-Sep-22 00:55 UTC
[Ocfs2-devel] [PATCH v2] ocfs2/dlm: fix race between convert and recovery
On 09/21/2015 05:23 PM, Joseph Qi wrote:> Hi Junxiao, > This solution may have problem in the following scenario: > Consider dlm_send_remote_convert_request has taken too much time, and > dlm_move_lockres_to_recovery_list runs first and new master will see > this node is currently in convert list after recovery. > Then dlm_send_remote_convert_request returns other than DLM_NORMAL and > it will revert it to grant list, then retry convert. This will makes > this node and master inconsistent. > I will try to find another solution to resolve the race issue.If master is down, no need retry convert. May check the return value of dlm_send_remote_convert_request(), if DLM_RECOVERING, don't retry, otherwise retry? Thanks, Junxiao.> > On 2015/9/20 15:20, Junxiao Bi wrote: >> Reviewed-by: Junxiao Bi <junxiao.bi at oracle.com> >> >>>> ? 2015?9?18????7:25?Joseph Qi <joseph.qi at huawei.com> ??? >>>> >>>> There is a race window between dlmconvert_remote and >>>> dlm_move_lockres_to_recovery_list, which will cause a lock with >>>> OCFS2_LOCK_BUSY in grant list, thus system hangs. >>>> >>>> dlmconvert_remote >>>> { >>>> spin_lock(&res->spinlock); >>>> list_move_tail(&lock->list, &res->converting); >>>> lock->convert_pending = 1; >>>> spin_unlock(&res->spinlock); >>>> >>>> status = dlm_send_remote_convert_request(); >>>>>>>>>>>>>>>> race window, master has queued ast and return DLM_NORMAL, >>>> and then down before sending ast. >>>> this node detects master down and calls >>>> dlm_move_lockres_to_recovery_list, which will revert the >>>> lock to grant list. >>>> Then OCFS2_LOCK_BUSY won't be cleared as new master won't >>>> send ast any more because it thinks already be authorized. >>>> >>>> spin_lock(&res->spinlock); >>>> lock->convert_pending = 0; >>>> if (status != DLM_NORMAL) >>>> dlm_revert_pending_convert(res, lock); >>>> spin_unlock(&res->spinlock); >>>> } >>>> >>>> In this case, just leave it in convert list and new master will take care >>>> of it after recovery. And if convert request returns other than >>>> DLM_NORMAL, convert thread will do the revert itself. So remove the >>>> revert logic in dlm_move_lockres_to_recovery_list. >>>> >>>> changelog since v1: >>>> Clean up convert_pending since it is now useless. > >