Hi Joseph,
I have enabled requested and Is the DLM log will capture to analyze
further. Also do we need to enable network side setting for allow max packets.
debugfs.ocfs2 -l
DLM allow
MSG off
TCP off
CONN off
VOTE off
DLM_DOMAIN off
HB_BIO off
BASTS allow
DLMFS off
ERROR allow
DLM_MASTER off
KTHREAD off
NOTICE allow
QUORUM off
SOCKET off
DLM_GLUE off
DLM_THREAD off
DLM_RECOVERY allow
HEARTBEAT off
CLUSTER off
Regards
Prabu
---- On Wed, 23 Dec 2015 07:51:38 +0530 Joseph Qi <joseph.qi at
huawei.com>wrote ----
Please also switch on BASTS and DLM_RECOVERY.
On 2015/12/23 10:11, gjprabu wrote:
> HI Joseph,
>
> Our current setup is having below details and DLM is now allowed (DLM
allow). Do you suggest any other option to get more logs.
>
> debugfs.ocfs2 -l
> DLM off ( DLM allow)
> MSG off
> TCP off
> CONN off
> VOTE off
> DLM_DOMAIN off
> HB_BIO off
> BASTS off
> DLMFS off
> ERROR allow
> DLM_MASTER off
> KTHREAD off
> NOTICE allow
> QUORUM off
> SOCKET off
> DLM_GLUE off
> DLM_THREAD off
> DLM_RECOVERY off
> HEARTBEAT off
> CLUSTER off
>
> Regards
> Prabu
> **
>
>
>
> ---- On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi <joseph.qi at
huawei.com>*wrote ----
>
> So you mean the four nodes are manually rebooted? If so you must
> analyze messages before you rebooted.
> If there are not enough messages, you can switch on some messages. IMO,
> mostly hang problems are caused by DLM bug, so I suggest switch on DLM
> related log and reproduce.
> You can use debugfs.ocfs2 -l to show all message switches and switch on
> you want. For example,
> # debugfs.ocfs2 -l DLM allow
>
> Thanks?
> Joseph
>
> On 2015/12/22 21:47, gjprabu wrote:
> > Hi Joseph,
> >
> > We are facing ocfs2 server hang problem frequently and
suddenly 4 nodes going to hang stat expect 1 node. After reboot everything is
come to normal, this behavior happend many times. Do we have any debug and fix
for this issue.
> >
> > Regards
> > Prabu
> >
> >
> > ---- On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi
<joseph.qi at huawei.com <mailto:joseph.qi at
huawei.com>>*wrote ----
> >
> > Hi Prabu,
> > From the log you provided, I can only see that node 5
disconnected with
> > node 2, 3, 1 and 4. It seemed that something wrong happened on
the four
> > nodes, and node 5 did recovery for them. After that, the four
nodes
> > joined again.
> >
> > On 2015/12/22 16:23, gjprabu wrote:
> > > Hi,
> > >
> > > Anybody please help me on this issue.
> > >
> > > Regards
> > > Prabu
> > >
> > > ---- On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu
<gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>
<mailto:gjprabu at zohocorp.com <mailto:gjprabu at
zohocorp.com>>>*wrote ----
> > >
> > > Dear Team,
> > >
> > > Ocfs2 clients are getting hang often and unusable.
Please find the logs. Kindly provide the solution, it will be highly
appreciated.
> > >
> > >
> > > [3659684.042530] o2dlm: Node 4 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> > >
> > > [3992993.101490]
(kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status =
DLM_IVLOCKID
> > > [3993002.193285]
(kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR:
A895BC216BE641A8A7E20AA89D57E051:M0000000000000062d2dcd000000000: bad lockres
name
> > > [3993032.457220]
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when
sending message 502 (key 0xc3460ae7) to node 2
> > > [3993062.547989]
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 2
> > > [3993064.860776]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 2
> > > [3993064.860804] o2cb: o2dlm has evicted node 2 from
domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993073.280062] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 2
> > > [3993094.623695]
(dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR:
A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error
-112 send AST to node 4
> > > [3993094.624281]
(dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = -112
> > > [3993094.687668]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when
sending message 502 (key 0xc3460ae7) to node 3
> > > [3993094.815662]
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -112 when
sending message 514 (key 0xc3460ae7) to node 1
> > > [3993094.816118]
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112
> > > [3993124.778525]
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when
sending message 514 (key 0xc3460ae7) to node 3
> > > [3993124.779032]
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107
> > > [3993133.332516] o2cb: o2dlm has evicted node 3 from
domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993139.915122] o2cb: o2dlm has evicted node 1 from
domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993147.071956] o2cb: o2dlm has evicted node 4 from
domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993147.071968]
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when
sending message 514 (key 0xc3460ae7) to node 4
> > > [3993147.071975]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > [3993147.071997]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > [3993147.072001]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > [3993147.072005]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > [3993147.072009]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > [3993147.075019]
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107
> > > [3993147.075353]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 1 went down!
> > > [3993147.075701]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > [3993147.076001]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down!
> > > [3993147.076329]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > [3993147.076634]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down!
> > > [3993147.076968]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > [3993147.077275]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 1
> > > [3993147.077591]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 3 up while
restarting
> > > [3993147.077594]
(dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11
> > > [3993155.171570]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down!
> > > [3993155.171874]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > [3993155.172150]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down!
> > > [3993155.172446]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > [3993155.172719]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 3
> > > [3993155.173001]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 4 up while
restarting
> > > [3993155.173003]
(dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11
> > > [3993155.173283]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down!
> > > [3993155.173581]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > [3993155.173858]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 4
> > > [3993155.174135]
(dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11
> > > [3993155.174458] o2dlm: Node 5 (me) is the Recovery
Master for the dead node 2 in domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993158.361220] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> > > [3993158.361228] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> > > [3993158.361305] o2dlm: Node 5 (me) is the Recovery
Master for the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993161.833543] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> > > [3993161.833551] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 3
> > > [3993161.833620] o2dlm: Node 5 (me) is the Recovery
Master for the dead node 3 in domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993165.188817] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> > > [3993165.188826] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 4
> > > [3993165.188907] o2dlm: Node 5 (me) is the Recovery
Master for the dead node 4 in domain A895BC216BE641A8A7E20AA89D57E051
> > > [3993168.551610] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> > >
> > > [3996486.869628] o2dlm: Node 4 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 4 5 ) 2 nodes
> > > [3996778.703664] o2dlm: Node 4 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes
> > > [3997012.295536] o2dlm: Node 2 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 2 5 ) 2 nodes
> > > [3997099.498157] o2dlm: Node 4 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 2 4 5 ) 3 nodes
> > > [3997783.633140] o2dlm: Node 1 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 4 5 ) 4 nodes
> > > [3997864.039868] o2dlm: Node 3 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> > >
> > > Regards
> > > Prabu
> > > **
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Ocfs2-users mailing list
> > > Ocfs2-users at oss.oracle.com
<mailto:Ocfs2-users at oss.oracle.com> <mailto:Ocfs2-users
at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com>>
> > > https://oss.oracle.com/mailman/listinfo/ocfs2-users
> > >
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20151223/8e4934e0/attachment-0001.html