Thiruselvam Velayutham
2014-Jan-21 07:05 UTC
[Ocfs2-users] OCFS2 issues during high availability testing
Dear Experts, Our DBA team is facing following problem. We did high availability testing and when we crash DB node 1, DB Node 2 also went down, and from the errors, i could see ocfs2 service has shutdown DB02 here is the issue in detail. DB01 DB02 Ap01 AP02 when i crash DB01 server, DB02 server also goes down and total oracle is collapsed. when doing vice versa, crash DB02, DB01 survives . and oracle continues to work without any issues messages_DB02.txt ================== Jan 20 13:15:52 kbmmoppdb02 avahi-daemon[8824]: Registering new address record for 172.20.1.9 on eth0. Jan 20 13:16:13 kbmmoppdb02 kernel: o2dlm: Node 0 leaves domain 8155F09482C94D3AB99D0669B91C0B1E Jan 20 13:16:13 kbmmoppdb02 kernel: o2dlm: Nodes in domain 8155F09482C94D3AB99D0669B91C0B1E: 1 Jan 20 13:17:27 kbmmoppdb02 kernel: o2net: connection to node kbmmoppdb01 (num 0) at 10.255.255.3:7777 has been idle for 30.0 seconds, shutting it down. Jan 20 13:17:27 kbmmoppdb02 kernel: (swapper,0,11):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1390245417.409760 now 1390245447.410787 dr 1390245417.409740 adv 1390245417.409769:1390245417.409770 func (d9d367e5:505) 1390245414.653885:1390245414.653892) Jan 20 13:17:27 kbmmoppdb02 kernel: o2net: no longer connected to node kbmmoppdb01 (num 0) at 10.255.255.3:7777 Jan 20 13:17:27 kbmmoppdb02 kernel: (kswapd0,576,10):dlm_send_remote_unlock_request:360 ERROR: Error -112 when sending message 506 (key 0x60f827ee) to node 0 Jan 20 13:17:48 kbmmoppdb02 kernel: o2net: connection to node kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 Jan 20 13:17:57 kbmmoppdb02 kernel: (o2net,6123,11):o2net_connect_expired:1676 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Jan 20 13:17:57 kbmmoppdb02 kernel: (dlm_thread,6161,8):dlm_drop_lockres_ref:2191 ERROR: Error -107 when sending message 507 (key 0x60f827ee) to node 0 Jan 20 13:17:57 kbmmoppdb02 kernel: (kswapd0,576,10):dlm_send_remote_unlock_request:360 ERROR: Error -107 when sending message 506 (key 0x60f827ee) to node 0 Jan 20 13:17:57 kbmmoppdb02 last message repeated 73 times Jan 20 13:17:57 kbmmoppdb02 kernel: (dlm_thread,6161,8):dlm_purge_lockres:193 ERROR: C5F98815D0BF43578B48C12C21114311: deref O000000000000000124facd00000000 failed -107 Jan 20 13:17:57 kbmmoppdb02 kernel: o2net: connection to node kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 Jan 20 13:18:25 kbmmoppdb02 last message repeated 9 times Jan 20 13:18:27 kbmmoppdb02 kernel: (o2net,6123,11):o2net_connect_expired:1676 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Jan 20 13:18:27 kbmmoppdb02 kernel: (dlm_thread,6161,8):dlm_drop_lockres_ref:2191 ERROR: Error -107 when sending message 507 (key 0x60f827ee) to node 0 Jan 20 13:18:27 kbmmoppdb02 kernel: (kswapd0,576,10):dlm_send_remote_unlock_request:360 ERROR: Error -107 when sending message 506 (key 0x60f827ee) to node 0 Jan 20 13:18:27 kbmmoppdb02 last message repeated 180 times Jan 20 13:18:27 kbmmoppdb02 kernel: (dlm_thread,6161,8):dlm_purge_lockres:193 ERROR: C5F98815D0BF43578B48C12C21114311: deref M000000000000000124facd00000000 failed -107 Jan 20 13:18:27 kbmmoppdb02 kernel: (dlm_thread,6161,10):dlm_drop_lockres_ref:2191 ERROR: Error -107 when sending message 507 (key 0x60f827ee) to node 0 Jan 20 13:18:27 kbmmoppdb02 kernel: (dlm_thread,6161,10):dlm_purge_lockres:193 ERROR: C5F98815D0BF43578B48C12C21114311: deref O000000000000000124facc00000000 failed -107 Jan 20 13:18:27 kbmmoppdb02 kernel: (dlm_thread,6161,4):dlm_drop_lockres_ref:2191 ERROR: Error -107 when sending message 507 (key 0x60f827ee) to node 0 Jan 20 13:18:27 kbmmoppdb02 kernel: (dlm_thread,6161,4):dlm_purge_lockres:193 ERROR: C5F98815D0BF43578B48C12C21114311: deref O000000000000000124fa8e00000000 failed -107 Jan 20 13:18:28 kbmmoppdb02 kernel: o2net: connection to node kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 Jan 20 13:18:31 kbmmoppdb02 kernel: o2net: connection to node kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 Jan 20 13:18:33 kbmmoppdb02 kernel: (events/11,49,11):o2quo_make_decision:158 ERROR: fencing this node because it is connected to a half-quorum of 1 out of 2 nodes which doesn't include the lowest active node 0 Jan 20 13:18:33 kbmmoppdb02 kernel: (events/11,49,11):o2hb_stop_all_regions:2026 ERROR: stopping heartbeat on all active regions. Jan 20 13:23:10 kbmmoppdb02 syslogd 1.4.1: restart. Jan 20 13:23:10 kbmmoppdb02 kernel: klogd 1.4.1, log source = /proc/kmsg started Regards, Thiruselvam V [Description: Description: Description: cid:image001.png at 01CE5876.CAF7DD70] VOIP : 603 521 6544 | Mobile :+91 9986150593?Fax:+91 80 41122605 | Skype : vthiruselvam at gmail.com<mailto:vthiruselvam at gmail.com> AOL : tvelayutham at kbace.com<mailto:tvelayutham at kbace.com> KBACE Technologies www.kbace.com<http://www.kbace.com/> Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message, and notify the sender immediately. If you or your employer does not consent to e-mail messages of this kind, please advise the sender immediately. Opinions, conclusions and other information expressed in this message are not given or endorsed by KBACE unless otherwise indicated by an authorized representative independent of this message. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20140121/35c696e2/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6659 bytes Desc: image001.png Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20140121/35c696e2/attachment-0001.png
Srinivas Eeda
2014-Jan-21 21:25 UTC
[Ocfs2-users] OCFS2 issues during high availability testing
On 01/20/2014 11:05 PM, Thiruselvam Velayutham wrote:> > Dear Experts, > > Our DBA team is facing following problem. > > We did high availability testing and when we crash DB node 1, DB Node > 2 also went down, and from the errors, i could see ocfs2 service has > shutdown DB02 >It appears the filesystem didn't go through a clean umount. In that case if the network resource goes down before the storage, then the other node thinks the network connection between the nodes broke. When that happens, they think it's a split brain and in a 2 node case, node with higher node number reboots. To fix this, make sure you unmount ocfs2 volumes before doing a reboot. PS: If the cluster setup has more than 2 nodes, this wouldn't be an issue, but still not a good idea to reboot the box without a clean umount.> here is the issue in detail. > > DB01 > > DB02 > > Ap01 > > AP02 > > when i crash DB01 server, DB02 server also goes down and total oracle > is collapsed. > > when doing vice versa, crash DB02, DB01 survives . and oracle > continues to work without any issues > > messages_DB02.txt > > ==================> > Jan 20 13:15:52 kbmmoppdb02 avahi-daemon[8824]: Registering new > address record for 172.20.1.9 on eth0. > > Jan 20 13:16:13 kbmmoppdb02 kernel: o2dlm: Node 0 leaves domain > 8155F09482C94D3AB99D0669B91C0B1E > > Jan 20 13:16:13 kbmmoppdb02 kernel: o2dlm: Nodes in domain > 8155F09482C94D3AB99D0669B91C0B1E: 1 > > Jan 20 13:17:27 kbmmoppdb02 kernel: o2net: connection to node > kbmmoppdb01 (num 0) at 10.255.255.3:7777 has been idle for 30.0 > seconds, shutting it down. > > Jan 20 13:17:27 kbmmoppdb02 kernel: > (swapper,0,11):o2net_idle_timer:1515 here are some times that might > help debug the situation: (tmr 1390245417.409760 now 1390245447.410787 > dr 1390245417.409740 adv 1390245417.409769:1390245417.409770 func > (d9d367e5:505) 1390245414.653885:1390245414.653892) > > Jan 20 13:17:27 kbmmoppdb02 kernel: o2net: no longer connected to node > kbmmoppdb01 (num 0) at 10.255.255.3:7777 > > Jan 20 13:17:27 kbmmoppdb02 kernel: > (kswapd0,576,10):dlm_send_remote_unlock_request:360 ERROR: Error -112 > when sending message 506 (key 0x60f827ee) to node 0 > > Jan 20 13:17:48 kbmmoppdb02 kernel: o2net: connection to node > kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 > > Jan 20 13:17:57 kbmmoppdb02 kernel: > (o2net,6123,11):o2net_connect_expired:1676 ERROR: no connection > established with node 0 after 30.0 seconds, giving up and returning > errors. > > Jan 20 13:17:57 kbmmoppdb02 kernel: > (dlm_thread,6161,8):dlm_drop_lockres_ref:2191 ERROR: Error -107 when > sending message 507 (key 0x60f827ee) to node 0 > > Jan 20 13:17:57 kbmmoppdb02 kernel: > (kswapd0,576,10):dlm_send_remote_unlock_request:360 ERROR: Error -107 > when sending message 506 (key 0x60f827ee) to node 0 > > Jan 20 13:17:57 kbmmoppdb02 last message repeated 73 times > > Jan 20 13:17:57 kbmmoppdb02 kernel: > (dlm_thread,6161,8):dlm_purge_lockres:193 ERROR: > C5F98815D0BF43578B48C12C21114311: deref > O000000000000000124facd00000000 failed -107 > > Jan 20 13:17:57 kbmmoppdb02 kernel: o2net: connection to node > kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 > > Jan 20 13:18:25 kbmmoppdb02 last message repeated 9 times > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (o2net,6123,11):o2net_connect_expired:1676 ERROR: no connection > established with node 0 after 30.0 seconds, giving up and returning > errors. > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (dlm_thread,6161,8):dlm_drop_lockres_ref:2191 ERROR: Error -107 when > sending message 507 (key 0x60f827ee) to node 0 > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (kswapd0,576,10):dlm_send_remote_unlock_request:360 ERROR: Error -107 > when sending message 506 (key 0x60f827ee) to node 0 > > Jan 20 13:18:27 kbmmoppdb02 last message repeated 180 times > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (dlm_thread,6161,8):dlm_purge_lockres:193 ERROR: > C5F98815D0BF43578B48C12C21114311: deref > M000000000000000124facd00000000 failed -107 > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (dlm_thread,6161,10):dlm_drop_lockres_ref:2191 ERROR: Error -107 when > sending message 507 (key 0x60f827ee) to node 0 > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (dlm_thread,6161,10):dlm_purge_lockres:193 ERROR: > C5F98815D0BF43578B48C12C21114311: deref > O000000000000000124facc00000000 failed -107 > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (dlm_thread,6161,4):dlm_drop_lockres_ref:2191 ERROR: Error -107 when > sending message 507 (key 0x60f827ee) to node 0 > > Jan 20 13:18:27 kbmmoppdb02 kernel: > (dlm_thread,6161,4):dlm_purge_lockres:193 ERROR: > C5F98815D0BF43578B48C12C21114311: deref > O000000000000000124fa8e00000000 failed -107 > > Jan 20 13:18:28 kbmmoppdb02 kernel: o2net: connection to node > kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 > > Jan 20 13:18:31 kbmmoppdb02 kernel: o2net: connection to node > kbmmoppdb01 (num 0) at 10.255.255.3:7777 shutdown, state 7 > > Jan 20 13:18:33 kbmmoppdb02 kernel: > (events/11,49,11):o2quo_make_decision:158 ERROR: fencing this node > because it is connected to a half-quorum of 1 out of 2 nodes which > doesn't include the lowest active node 0 > > Jan 20 13:18:33 kbmmoppdb02 kernel: > (events/11,49,11):o2hb_stop_all_regions:2026 ERROR: stopping heartbeat > on all active regions. > > Jan 20 13:23:10 kbmmoppdb02 syslogd 1.4.1: restart. > > Jan 20 13:23:10 kbmmoppdb02 kernel: klogd 1.4.1, log source = > /proc/kmsg started > > Regards, > > Thiruselvam V > > Description: Description: Description: cid:image001.png at 01CE5876.CAF7DD70 > > VOIP : 603 521 6544 | Mobile :+91 9986150593?Fax:+91 80 41122605 | > Skype : vthiruselvam at gmail.com <mailto:vthiruselvam at gmail.com> AOL : > tvelayutham at kbace.com <mailto:tvelayutham at kbace.com> > > KBACE Technologies > > www.kbace.com <http://www.kbace.com/> > > Privileged/Confidential Information may be contained in this message. > If you are not the addressee indicated in this message (or responsible > for delivery of the message to such person), you may not copy or > deliver this message to anyone. In such case, you should destroy this > message, and notify the sender immediately. If you or your employer > does not consent to e-mail messages of this kind, please advise the > sender immediately. Opinions, conclusions and other information > expressed in this message are not given or endorsed by KBACE unless > otherwise indicated by an authorized representative independent of > this message. > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20140121/46bbb65b/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 6659 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20140121/46bbb65b/attachment-0001.png