I have a two node ocfs2 cluster, and in the /etc/ocfs2/cluster.conf file, the node_count=0 rather than 2. Does this have to be a wrong config, and how would this affect the cluster? Thanks. Hai Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110909/2a7258c3/attachment.html
I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with "ifdown eth1". I got following weird logs on both nodes: Sep 7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num 1) at 10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down. Sep 7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1315417519.185025 now 1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032 func (b9bb7168:504) 1315417518.872227:1315417518.872268) Sep 7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node dbtest-02 (num 1) at 10.194.59.65:7777 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_do_master_request:1334 ERROR: link to 1 went down! Sep 7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_get_lock_resource:917 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604 ERROR: status = -112 Sep 7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Sep 7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node dbtest-02 (num 1) at 10.194.59.65:7777 Sep 7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more than 120 seconds. Sep 7 10:48:37 dbtest-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 7 10:48:37 dbtest-01 kernel: events/0 D ffff810001004420 0 10 1 11 9 (L-TLB) Sep 7 10:48:37 dbtest-01 kernel: ffff81083ffedc80 0000000000000046 ffffffff80333680 0000000000000001 Sep 7 10:48:37 dbtest-01 kernel: 0000000000000400 000000000000000a ffff81083ffe1820 ffffffff80309b60 Sep 7 10:48:37 dbtest-01 kernel: 0030b62498ce7b3f 000000000000416b ffff81083ffe1a08 0000000000000000 Sep 7 10:48:37 dbtest-01 kernel: Call Trace: Sep 7 10:48:37 dbtest-01 kernel: Call Trace: Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80064167>] wait_for_completion+0x79/0xa2 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8008e16d>] default_wake_function+0x0/0xe Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884e64b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884e78d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff885013e5>] :ocfs2:ocfs2_orphan_scan_work+0x0/0x83 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884ed1e4>] :ocfs2:ocfs2_orphan_scan_lock+0x55/0x84 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884fc59b>] :ocfs2:ocfs2_queue_orphan_scan+0x32/0x147 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff885013ff>] :ocfs2:ocfs2_orphan_scan_work+0x1a/0x83 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004dc37>] run_workqueue+0x94/0xe4 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004a472>] worker_thread+0x0/0x122 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004a562>] worker_thread+0xf0/0x122 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8008e16d>] default_wake_function+0x0/0xe Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80032bdc>] kthread+0xfe/0x132 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8005efb1>] child_rip+0xa/0x11 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80032ade>] kthread+0x0/0x132 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8005efa7>] child_rip+0x0/0x11 Sep 7 10:48:37 dbtest-01 kernel: Does anyone know why this happened? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110910/95f29a65/attachment-0001.html
is ocfs2 heartbeat transferred over the network, or just updating a file to the shared disk? If the heartbeat lost, what should happen? what if only one node is writing, and the other is still? Will it still cause any file system issue? Thanks. Hai Tao From: taoh666 at hotmail.com To: ocfs2-users at oss.oracle.com Date: Sat, 10 Sep 2011 00:50:23 -0700 Subject: [Ocfs2-users] disable heartbeat nic caused ocfs2 errors I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with "ifdown eth1". I got following weird logs on both nodes: Sep 7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num 1) at 10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down. Sep 7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1315417519.185025 now 1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032 func (b9bb7168:504) 1315417518.872227:1315417518.872268) Sep 7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node dbtest-02 (num 1) at 10.194.59.65:7777 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_do_master_request:1334 ERROR: link to 1 went down! Sep 7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_get_lock_resource:917 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604 ERROR: status = -112 Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604 ERROR: status = -112 Sep 7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Sep 7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node dbtest-02 (num 1) at 10.194.59.65:7777 Sep 7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more than 120 seconds. Sep 7 10:48:37 dbtest-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 7 10:48:37 dbtest-01 kernel: events/0 D ffff810001004420 0 10 1 11 9 (L-TLB) Sep 7 10:48:37 dbtest-01 kernel: ffff81083ffedc80 0000000000000046 ffffffff80333680 0000000000000001 Sep 7 10:48:37 dbtest-01 kernel: 0000000000000400 000000000000000a ffff81083ffe1820 ffffffff80309b60 Sep 7 10:48:37 dbtest-01 kernel: 0030b62498ce7b3f 000000000000416b ffff81083ffe1a08 0000000000000000 Sep 7 10:48:37 dbtest-01 kernel: Call Trace: Sep 7 10:48:37 dbtest-01 kernel: Call Trace: Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80064167>] wait_for_completion+0x79/0xa2 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8008e16d>] default_wake_function+0x0/0xe Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884e64b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884e78d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff885013e5>] :ocfs2:ocfs2_orphan_scan_work+0x0/0x83 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884ed1e4>] :ocfs2:ocfs2_orphan_scan_lock+0x55/0x84 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884fc59b>] :ocfs2:ocfs2_queue_orphan_scan+0x32/0x147 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff885013ff>] :ocfs2:ocfs2_orphan_scan_work+0x1a/0x83 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004dc37>] run_workqueue+0x94/0xe4 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004a472>] worker_thread+0x0/0x122 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004a562>] worker_thread+0xf0/0x122 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8008e16d>] default_wake_function+0x0/0xe Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80032bdc>] kthread+0xfe/0x132 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8005efb1>] child_rip+0xa/0x11 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80032ade>] kthread+0x0/0x132 Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8005efa7>] child_rip+0x0/0x11 Sep 7 10:48:37 dbtest-01 kernel: Does anyone know why this happened? Thanks. _______________________________________________ Ocfs2-users mailing list Ocfs2-users at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110910/74ac93a4/attachment.html
Sunil Mushran
2011-Sep-12 18:01 UTC
[Ocfs2-users] disable heartbeat nic caused ocfs2 errors
ocfs2 uses disk heartbeat to detect node liveness. It uses net heartbeat to detect link liveness. Both need to operate for the cluster to function. If the network link between two nodes snaps, then one of the two nodes is fenced. The stack below indicates that the two nodes are not able to communicate. The two nodes are waiting on the quorum to fence one of the nodes. It appears you have upped the disk heartbeat timeout > 2mins. I would imagine one of the nodes reset after that timeout. On 09/10/2011 08:54 PM, Hai Tao wrote:> is ocfs2 heartbeat transferred over the network, or just updating a file to the shared disk? > > If the heartbeat lost, what should happen? what if only one node is writing, and the other is still? Will it still cause any file system issue? > > > Thanks. > Hai Tao > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > From: taoh666 at hotmail.com > To: ocfs2-users at oss.oracle.com > Date: Sat, 10 Sep 2011 00:50:23 -0700 > Subject: [Ocfs2-users] disable heartbeat nic caused ocfs2 errors > > I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with "ifdown eth1". I got following weird logs on both nodes: > > Sep 7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num 1) at 10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down. > Sep 7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1315417519.185025 now 1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032 func (b9bb7168:504) 1315417518.872227:1315417518.872268) > Sep 7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node dbtest-02 (num 1) at 10.194.59.65:7777 > Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112 > Sep 7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_do_master_request:1334 ERROR: link to 1 went down! > Sep 7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_get_lock_resource:917 ERROR: status = -112 > Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457 ERROR: status = -112 > Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604 ERROR: status = -112 > Sep 7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604 ERROR: status = -112 > Sep 7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. > Sep 7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node dbtest-02 (num 1) at 10.194.59.65:7777 > Sep 7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more than 120 seconds. > Sep 7 10:48:37 dbtest-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 7 10:48:37 dbtest-01 kernel: events/0 D ffff810001004420 0 10 1 11 9 (L-TLB) > Sep 7 10:48:37 dbtest-01 kernel: ffff81083ffedc80 0000000000000046 ffffffff80333680 0000000000000001 > Sep 7 10:48:37 dbtest-01 kernel: 0000000000000400 000000000000000a ffff81083ffe1820 ffffffff80309b60 > Sep 7 10:48:37 dbtest-01 kernel: 0030b62498ce7b3f 000000000000416b ffff81083ffe1a08 0000000000000000 > Sep 7 10:48:37 dbtest-01 kernel: Call Trace: > Sep 7 10:48:37 dbtest-01 kernel: Call Trace: > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80064167>] wait_for_completion+0x79/0xa2 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8008e16d>] default_wake_function+0x0/0xe > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884e64b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884e78d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff885013e5>] :ocfs2:ocfs2_orphan_scan_work+0x0/0x83 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884ed1e4>] :ocfs2:ocfs2_orphan_scan_lock+0x55/0x84 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff884fc59b>] :ocfs2:ocfs2_queue_orphan_scan+0x32/0x147 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff885013ff>] :ocfs2:ocfs2_orphan_scan_work+0x1a/0x83 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004dc37>] run_workqueue+0x94/0xe4 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004a472>] worker_thread+0x0/0x122 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8004a562>] worker_thread+0xf0/0x122 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8008e16d>] default_wake_function+0x0/0xe > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80032bdc>] kthread+0xfe/0x132 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8005efb1>] child_rip+0xa/0x11 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff80032ade>] kthread+0x0/0x132 > Sep 7 10:48:37 dbtest-01 kernel: [<ffffffff8005efa7>] child_rip+0x0/0x11 > Sep 7 10:48:37 dbtest-01 kernel: > > Does anyone know why this happened? > > Thanks. > > _______________________________________________ Ocfs2-users mailing list Ocfs2-users at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110912/f31eac49/attachment.html
It is wrong config. On 09/09/2011 10:15 PM, Hai Tao wrote:> I have a two node ocfs2 cluster, and in the /etc/ocfs2/cluster.conf file, the node_count=0 rather than 2. Does this have to be a wrong config, and how would this affect the cluster? > > Thanks. > Hai Tao >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110912/86570a01/attachment.html