Hi Prabu,
[193918.928968] (kworker/u128:1,51132,30):dlm_do_assert_master:1717 ERROR: Error
-112 when sending message 502 (key 0xc3460ae7) to node 1
[193918.929004] (kworker/u128:3,63088,31):dlm_send_remote_convert_request:392
ERROR: Error -112 when sending message 504 (key 0xc3460ae7) to node 1
The above error messages show that the link between this node and node
1 is down. So that it cannot send dlm messages.
On 2015/9/29 19:52, gjprabu wrote:>
> Hi Joseph,
>
> Our self testing purpose we reboot Node1 and Node7 this is the log
shows. I have cross checked configuration in /etc/ocfs2/cluster.conf and its
fine. Anybody help on this issue. Hope this issue related to OCFS2 not on RBD.
>
> /sys/kernel/config/cluster/ocfs2/node/
> [root at integ-cm2 node]# ls
> integ-ci-1 integ-cm1 integ-cm2 integ-hm2 integ-hm5 integ-hm8
integ-hm9
>
> Also pls find missed out logs.
>
> [ 475.407086] o2dlm: Joining domain A895BC216BE641A8A7E20AA89D57E051 ( 1 3
4 7 ) 4 nodes
> [ 880.734421] o2dlm: Node 2 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [ 892.746728] o2dlm: Node 2 leaves domain A895BC216BE641A8A7E20AA89D57E051
( 1 3 4 7 ) 4 nodes
> [ 905.264066] o2dlm: Node 2 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [12313.418294] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051
> [12315.042208] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051
> [12315.402103] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [12315.402111] o2dlm: Node 4 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [12315.402114] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [12320.402074] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [12320.402080] o2dlm: Node 4 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [12320.402083] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [12698.830376] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [181348.383986] o2cb: o2dlm has evicted node 7 from domain
A895BC216BE641A8A7E20AA89D57E051
> [181349.048120] o2cb: o2dlm has evicted node 7 from domain
A895BC216BE641A8A7E20AA89D57E051
> [181351.972048] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 7
> [181351.972056] o2dlm: Node 1 (he) is the Recovery Master for the dead node
7 in domain A895BC216BE641A8A7E20AA89D57E051
> [181351.972059] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [181356.972035] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 7
> [181356.972040] o2dlm: Node 1 (he) is the Recovery Master for the dead node
7 in domain A895BC216BE641A8A7E20AA89D57E051
> [181356.972042] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [181361.972046] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 7
> [181361.972054] o2dlm: Node 1 (he) is the Recovery Master for the dead node
7 in domain A895BC216BE641A8A7E20AA89D57E051
> [181361.972057] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [181366.972049] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 7
> [181366.972056] o2dlm: Node 1 (he) is the Recovery Master for the dead node
7 in domain A895BC216BE641A8A7E20AA89D57E051
> [181366.972059] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [181599.543509] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [183251.706097] o2dlm: Node 7 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 ) 4 nodes
> [183462.532465] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [183506.924225] o2dlm: Node 7 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 ) 4 nodes
> [183709.344072] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [183905.441289] o2dlm: Node 7 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 ) 4 nodes
> [184103.391770] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [184175.702196] o2dlm: Node 7 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 ) 4 nodes
> [184363.166986] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> *[193918.928968] (kworker/u128:1,51132,30):dlm_do_assert_master:1717 ERROR:
Error -112 when sending message 502 (key 0xc3460ae7) to node 1*
> *[193918.929004]
(kworker/u128:3,63088,31):dlm_send_remote_convert_request:392 ERROR: Error -112
when sending message 504 (key 0xc3460ae7) to node 1*
> [193918.929035] o2dlm: Waiting on the death of node 1 in domain
A895BC216BE641A8A7E20AA89D57E051
> [193918.929083] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051
> [193920.386365] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051
> [193921.972105] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193921.972114] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193921.972116] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193926.972056] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193926.972063] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193926.972066] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193931.972054] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193931.972062] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193931.972065] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193936.972101] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193936.972108] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193936.972110] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193941.972066] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193941.972072] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193941.972075] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193946.972077] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193946.972084] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193946.972086] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193951.972107] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193951.972114] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193951.972116] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193956.972073] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193956.972081] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193956.972084] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193961.972075] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193961.972082] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193961.972084] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193966.972051] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193966.972059] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193966.972062] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193971.972115] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193971.972122] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193971.972124] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [193976.972103] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [193976.972111] o2dlm: Node 2 (he) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [193976.972114] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [194143.962241] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [199847.473092] o2dlm: Node 7 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 ) 4 nodes
> [208215.106305] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [258418.054204] o2cb: o2dlm has evicted node 7 from domain
A895BC216BE641A8A7E20AA89D57E051
> [258418.957738] o2cb: o2dlm has evicted node 7 from domain
A895BC216BE641A8A7E20AA89D57E051
> [264056.408719] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [264464.605542] o2dlm: Node 7 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 ) 4 nodes
> [275619.497198] o2dlm: Node 7 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
> [426628.076148] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051
> [426628.885084] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> [426628.891170] o2dlm: Node 3 (me) is the Recovery Master for the dead node
1 in domain A895BC216BE641A8A7E20AA89D57E051
> [426634.182384] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> [427001.383315] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051
( 1 2 3 4 7 ) 5 nodes
>
>
>
>
> Regards
> Prabu**
>
>
>
>
>
> ---- On Tue, 29 Sep 2015 15:01:40 +0530 *Joseph Qi <joseph.qi at
huawei.com <mailto:joseph.qi at huawei.com>>* wrote ----
>
>
>
> On 2015/9/29 15:18, gjprabu wrote:
> > Hi Joseph,
> >
> > We have total 7 nodes and this problem occurs in multiple
nodes simultaneously not in particular one node. we checked network and its
fine.
> > When we remount the ocfs2 partition, this problem is get fixed
temporarily and same problem reoccurs after some time.
> >
> > Even we do have problem while unmountinng. umount process goes
to "D" stat, then i need to restart server itself. Is there any
solution for this issue.
> >
> > I have tried running fsck.ocfs2 in problematic machine but its
throwing error.
> >
> > fsck.ocfs2 1.8.0
> > fsck.ocfs2: I/O error on channel while opening
"/zoho/build/downloads"
> >
> IMO, this can happen if the mountpoint is offline.
>
> >
> > Please refer the latest logs from one node.
> >
> > [258418.054204] o2cb: o2dlm has evicted node 7 from domain
A895BC216BE641A8A7E20AA89D57E051
> > [258418.957738] o2cb: o2dlm has evicted node 7 from domain
A895BC216BE641A8A7E20AA89D57E051
> > [264056.408719] o2dlm: Node 7 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 7 ) 5 nodes
> > [264464.605542] o2dlm: Node 7 leaves domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 ) 4 nodes
> > [275619.497198] o2dlm: Node 7 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 7 ) 5 nodes
> > [426628.076148] o2cb: o2dlm has evicted node 1 from domain
A895BC216BE641A8A7E20AA89D57E051
> > [426628.885084] o2dlm: Begin recovery on domain
A895BC216BE641A8A7E20AA89D57E051 for node 1
> > [426628.891170] o2dlm: Node 3 (me) is the Recovery Master for
the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
> > [426634.182384] o2dlm: End recovery on domain
A895BC216BE641A8A7E20AA89D57E051
> > [427001.383315] o2dlm: Node 1 joins domain
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 7 ) 5 nodes
> >
> The above message shows nodes in your cluster is frequently in and
out.
> I suggest you check the cluster config in each node
> (/etc/ocfs2/cluster.conf as well as
/sys/kernel/config/<cluster_name>/node/).
> I haven't used ocfs2 along with ceph rbd. So I am not sure if
it has
> relations with rbd.
>
> >
> >
> >
> > Regards
> > G.J
> > **
> >
> >
> >
> > ---- On Fri, 25 Sep 2015 06:26:57 +0530 *Joseph Qi
<joseph.qi at huawei.com <mailto:joseph.qi at huawei.com>>* wrote
----
> >
> > On 2015/9/24 18:30, gjprabu wrote:
> > > Hi All,
> > >
> > > Can someone tell me what kind of is this.
> > >
> > > Regards
> > > Prabu GJ
> > >
> > >
> > > ---- On Wed, 23 Sep 2015 18:26:13 +0530 *gjprabu
<gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>
<mailto:gjprabu at zohocorp.com <mailto:gjprabu at
zohocorp.com>>>* wrote ----
> > >
> > > Hi All,
> > >
> > > This issue we faced in locally machine also. but it is
not in all the client only two ocfs2 client we facing this issue.
> > >
> > > Regards
> > > Prabu GJ
> > >
> > >
> > >
> > > ---- On Wed, 23 Sep 2015 17:49:51 +0530 *gjprabu
<gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>
<mailto:gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>>
<mailto:gjprabu at zohocorp.com <mailto:gjprabu at
zohocorp.com>>>* wrote ----
> > >
> > >
> > >
> > > Hi All,
> > >
> > > We are using ocfs2 for RBD mounting and everything works
fine, but while writing or moving the data via the scripts after written it
shows below error. Please anybody help on this issue.
> > >
> > >
> > >
> > > # ls -althr
> > > ls: cannot access MICKEYLITE_3_0_M4_1_TEST: Input/output
error
> > > ls: cannot access MICKEYLITE_3_0_M4_1_OLD: Input/output
error
> > > total 0
> > > d????????? ? ? ? ? ? MICKEYLITE_3_0_M4_1_TEST
> > > d????????? ? ? ? ? ? MICKEYLITE_3_0_M4_1_OLD
> > >
> > > _*Partition details.*_
> > >
> > > /dev/rbd0 ocfs2 9.6T 140G 9.5T 2% /zoho/build/downloads
> > >
> > > /etc/ocfs2/cluster.conf
> > > cluster:
> > > node_count=7
> > > heartbeat_mode = local
> > > name=ocfs2
> > >
> > > node:
> > > ip_port = 7777
> > > ip_address = 10.1.1.50
> > > number = 1
> > > name = integ-hm5
> > > cluster = ocfs2
> > >
> > > node:
> > > ip_port = 7777
> > > ip_address = 10.1.1.51
> > > number = 2
> > > name = integ-hm9
> > > cluster = ocfs2
> > >
> > > node:
> > > ip_port = 7777
> > > ip_address = 10.1.1.52
> > > number = 3
> > > name = integ-hm2
> > > cluster = ocfs2
> > >
> > > node:
> > > ip_port = 7777
> > > ip_address = 10.1.1.53
> > > number = 4
> > > name = integ-ci-1
> > > cluster = ocfs2
> > > node:
> > > ip_port = 7777
> > > ip_address = 10.1.1.54
> > > number = 5
> > > name = integ-cm2
> > > cluster = ocfs2
> > > node:
> > > ip_port = 7777
> > > ip_address = 10.1.1.55
> > > number = 6
> > > name = integ-cm1
> > > cluster = ocfs2
> > > node:
> > > ip_port = 7777
> > > ip_address = 10.1.1.56
> > > number = 7
> > > name = integ-hm8
> > > cluster = ocfs2
> > >
> > >
> > > *_Error on dmesg_*
> > >
> > >
> > > [516421.342393] (dlm_thread,51005,25):dlm_flush_asts:599
ERROR: status = -112
> > > [517119.689992]
(httpd,64399,31):dlm_do_master_request:1383 ERROR: link to 1 went down!
> > > [517119.690003]
(dlm_thread,51005,25):dlm_send_proxy_ast_msg:482 ERROR:
A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error
-112 send AST to node 1
> > > [517119.690028] (dlm_thread,51005,25):dlm_flush_asts:599
ERROR: status = -112
> > > [517119.690034]
(dlm_thread,51005,25):dlm_send_proxy_ast_msg:482 ERROR:
A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error
-107 send AST to node 1
> > > [517119.690036] (dlm_thread,51005,25):dlm_flush_asts:599
ERROR: status = -107
> > > [517119.700425]
(httpd,64399,31):dlm_get_lock_resource:968 ERROR: status = -112
> > > [517517.894949]
(dlm_thread,51005,25):dlm_send_proxy_ast_msg:482 ERROR:
A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error
-112 send AST to node 1
> > > [517517.899640] (dlm_thread,51005,25):dlm_flush_asts:599
ERROR: status = -112
> > >
> > The error messages means the connection between this node and
node 1 has problem.
> > You have to check the network.
> >
> > >
> > > Regards
> > > Prabu GJ
> > >
> > >
> > >
> > > _______________________________________________
> > > Ocfs2-users mailing list
> > > Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at
oss.oracle.com> <mailto:Ocfs2-users at oss.oracle.com
<mailto:Ocfs2-users at oss.oracle.com>> <mailto:Ocfs2-users at
oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com>>
> > > https://oss.oracle.com/mailman/listinfo/ocfs2-users
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Ocfs2-users mailing list
> > > Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at
oss.oracle.com> <mailto:Ocfs2-users at oss.oracle.com
<mailto:Ocfs2-users at oss.oracle.com>>
> > > https://oss.oracle.com/mailman/listinfo/ocfs2-users
> > >
> >
> >
> >
>
>
>