HI Joseph, Our current setup is having below details and DLM is now allowed (DLM allow). Do you suggest any other option to get more logs. debugfs.ocfs2 -l DLM off ( DLM allow) MSG off TCP off CONN off VOTE off DLM_DOMAIN off HB_BIO off BASTS off DLMFS off ERROR allow DLM_MASTER off KTHREAD off NOTICE allow QUORUM off SOCKET off DLM_GLUE off DLM_THREAD off DLM_RECOVERY off HEARTBEAT off CLUSTER off Regards Prabu ---- On Wed, 23 Dec 2015 07:30:54 +0530 Joseph Qi <joseph.qi at huawei.com>wrote ---- So you mean the four nodes are manually rebooted? If so you must analyze messages before you rebooted. If there are not enough messages, you can switch on some messages. IMO, mostly hang problems are caused by DLM bug, so I suggest switch on DLM related log and reproduce. You can use debugfs.ocfs2 -l to show all message switches and switch on you want. For example, # debugfs.ocfs2 -l DLM allow Thanks? Joseph On 2015/12/22 21:47, gjprabu wrote: > Hi Joseph, > > We are facing ocfs2 server hang problem frequently and suddenly 4 nodes going to hang stat expect 1 node. After reboot everything is come to normal, this behavior happend many times. Do we have any debug and fix for this issue. > > Regards > Prabu > > > ---- On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi <joseph.qi at huawei.com>*wrote ---- > > Hi Prabu, > From the log you provided, I can only see that node 5 disconnected with > node 2, 3, 1 and 4. It seemed that something wrong happened on the four > nodes, and node 5 did recovery for them. After that, the four nodes > joined again. > > On 2015/12/22 16:23, gjprabu wrote: > > Hi, > > > > Anybody please help me on this issue. > > > > Regards > > Prabu > > > > ---- On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu <gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>>*wrote ---- > > > > Dear Team, > > > > Ocfs2 clients are getting hang often and unusable. Please find the logs. Kindly provide the solution, it will be highly appreciated. > > > > > > [3659684.042530] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes > > > > [3992993.101490] (kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = DLM_IVLOCKID > > [3993002.193285] (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: A895BC216BE641A8A7E20AA89D57E051:M0000000000000062d2dcd000000000: bad lockres name > > [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2 > > [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 > > [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 > > [3993064.860804] o2cb: o2dlm has evicted node 2 from domain A895BC216BE641A8A7E20AA89D57E051 > > [3993073.280062] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 2 > > [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error -112 send AST to node 4 > > [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = -112 > > [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3 > > [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1 > > [3993094.816118] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112 > > [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3 > > [3993124.779032] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 > > [3993133.332516] o2cb: o2dlm has evicted node 3 from domain A895BC216BE641A8A7E20AA89D57E051 > > [3993139.915122] o2cb: o2dlm has evicted node 1 from domain A895BC216BE641A8A7E20AA89D57E051 > > [3993147.071956] o2cb: o2dlm has evicted node 4 from domain A895BC216BE641A8A7E20AA89D57E051 > > [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 4 > > [3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > [3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > [3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > [3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > [3993147.072009] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > [3993147.075019] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 > > [3993147.075353] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 1 went down! > > [3993147.075701] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > [3993147.076001] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down! > > [3993147.076329] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > [3993147.076634] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! > > [3993147.076968] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > [3993147.077275] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 1 > > [3993147.077591] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 3 up while restarting > > [3993147.077594] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > [3993155.171570] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down! > > [3993155.171874] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > [3993155.172150] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! > > [3993155.172446] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > [3993155.172719] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 3 > > [3993155.173001] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 4 up while restarting > > [3993155.173003] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > [3993155.173283] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! > > [3993155.173581] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > [3993155.173858] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 4 > > [3993155.174135] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > [3993155.174458] o2dlm: Node 5 (me) is the Recovery Master for the dead node 2 in domain A895BC216BE641A8A7E20AA89D57E051 > > [3993158.361220] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > [3993158.361228] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 1 > > [3993158.361305] o2dlm: Node 5 (me) is the Recovery Master for the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051 > > [3993161.833543] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > [3993161.833551] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 3 > > [3993161.833620] o2dlm: Node 5 (me) is the Recovery Master for the dead node 3 in domain A895BC216BE641A8A7E20AA89D57E051 > > [3993165.188817] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > [3993165.188826] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 4 > > [3993165.188907] o2dlm: Node 5 (me) is the Recovery Master for the dead node 4 in domain A895BC216BE641A8A7E20AA89D57E051 > > [3993168.551610] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > > > [3996486.869628] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 4 5 ) 2 nodes > > [3996778.703664] o2dlm: Node 4 leaves domain A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes > > [3997012.295536] o2dlm: Node 2 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 5 ) 2 nodes > > [3997099.498157] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 4 5 ) 3 nodes > > [3997783.633140] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 4 5 ) 4 nodes > > [3997864.039868] o2dlm: Node 3 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes > > > > Regards > > Prabu > > ** > > > > > > > > > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com> > > https://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20151223/aed88bf0/attachment-0001.html
Please also switch on BASTS and DLM_RECOVERY. On 2015/12/23 10:11, gjprabu wrote:> HI Joseph, > > Our current setup is having below details and DLM is now allowed (DLM allow). Do you suggest any other option to get more logs. > > debugfs.ocfs2 -l > DLM off ( DLM allow) > MSG off > TCP off > CONN off > VOTE off > DLM_DOMAIN off > HB_BIO off > BASTS off > DLMFS off > ERROR allow > DLM_MASTER off > KTHREAD off > NOTICE allow > QUORUM off > SOCKET off > DLM_GLUE off > DLM_THREAD off > DLM_RECOVERY off > HEARTBEAT off > CLUSTER off > > Regards > Prabu > ** > > > > ---- On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi <joseph.qi at huawei.com>*wrote ---- > > So you mean the four nodes are manually rebooted? If so you must > analyze messages before you rebooted. > If there are not enough messages, you can switch on some messages. IMO, > mostly hang problems are caused by DLM bug, so I suggest switch on DLM > related log and reproduce. > You can use debugfs.ocfs2 -l to show all message switches and switch on > you want. For example, > # debugfs.ocfs2 -l DLM allow > > Thanks? > Joseph > > On 2015/12/22 21:47, gjprabu wrote: > > Hi Joseph, > > > > We are facing ocfs2 server hang problem frequently and suddenly 4 nodes going to hang stat expect 1 node. After reboot everything is come to normal, this behavior happend many times. Do we have any debug and fix for this issue. > > > > Regards > > Prabu > > > > > > ---- On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi <joseph.qi at huawei.com <mailto:joseph.qi at huawei.com>>*wrote ---- > > > > Hi Prabu, > > From the log you provided, I can only see that node 5 disconnected with > > node 2, 3, 1 and 4. It seemed that something wrong happened on the four > > nodes, and node 5 did recovery for them. After that, the four nodes > > joined again. > > > > On 2015/12/22 16:23, gjprabu wrote: > > > Hi, > > > > > > Anybody please help me on this issue. > > > > > > Regards > > > Prabu > > > > > > ---- On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu <gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com> <mailto:gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>>>*wrote ---- > > > > > > Dear Team, > > > > > > Ocfs2 clients are getting hang often and unusable. Please find the logs. Kindly provide the solution, it will be highly appreciated. > > > > > > > > > [3659684.042530] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes > > > > > > [3992993.101490] (kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = DLM_IVLOCKID > > > [3993002.193285] (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: A895BC216BE641A8A7E20AA89D57E051:M0000000000000062d2dcd000000000: bad lockres name > > > [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2 > > > [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 > > > [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 > > > [3993064.860804] o2cb: o2dlm has evicted node 2 from domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993073.280062] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 2 > > > [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error -112 send AST to node 4 > > > [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = -112 > > > [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3 > > > [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1 > > > [3993094.816118] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112 > > > [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3 > > > [3993124.779032] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 > > > [3993133.332516] o2cb: o2dlm has evicted node 3 from domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993139.915122] o2cb: o2dlm has evicted node 1 from domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993147.071956] o2cb: o2dlm has evicted node 4 from domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 4 > > > [3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > > [3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > > [3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > > [3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > > [3993147.072009] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 > > > [3993147.075019] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 > > > [3993147.075353] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 1 went down! > > > [3993147.075701] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > [3993147.076001] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down! > > > [3993147.076329] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > [3993147.076634] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! > > > [3993147.076968] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > [3993147.077275] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 1 > > > [3993147.077591] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 3 up while restarting > > > [3993147.077594] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > > [3993155.171570] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down! > > > [3993155.171874] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > [3993155.172150] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! > > > [3993155.172446] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > [3993155.172719] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 3 > > > [3993155.173001] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 4 up while restarting > > > [3993155.173003] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > > [3993155.173283] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! > > > [3993155.173581] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > [3993155.173858] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 4 > > > [3993155.174135] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > > [3993155.174458] o2dlm: Node 5 (me) is the Recovery Master for the dead node 2 in domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993158.361220] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993158.361228] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 1 > > > [3993158.361305] o2dlm: Node 5 (me) is the Recovery Master for the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993161.833543] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993161.833551] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 3 > > > [3993161.833620] o2dlm: Node 5 (me) is the Recovery Master for the dead node 3 in domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993165.188817] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993165.188826] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 4 > > > [3993165.188907] o2dlm: Node 5 (me) is the Recovery Master for the dead node 4 in domain A895BC216BE641A8A7E20AA89D57E051 > > > [3993168.551610] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3996486.869628] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 4 5 ) 2 nodes > > > [3996778.703664] o2dlm: Node 4 leaves domain A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes > > > [3997012.295536] o2dlm: Node 2 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 5 ) 2 nodes > > > [3997099.498157] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 4 5 ) 3 nodes > > > [3997783.633140] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 4 5 ) 4 nodes > > > [3997864.039868] o2dlm: Node 3 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes > > > > > > Regards > > > Prabu > > > ** > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Ocfs2-users mailing list > > > Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com> <mailto:Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com>> > > > https://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > > > > > > > > >