Kristiansen Morten
2013-Mar-21 13:46 UTC
[Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
Hi, We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and synchronizing the disks. When they rebooted the servers, one of the nodes, tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting disk was not found. Then we tried to reboot that node, causing all nodes to reboot. Time round about 02:25. When examine the /var/log/messages I discovered a BUG message on one of the node that rebooted unexpectedly, tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. Is this a well known bug? Does any body have a solution to this problem? Below is a extract of o2net and ocfs2 messages from the /var/log/message file. /var/log/messages til tos-dipsprod-07: Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-04 (num 5) at 192.168.7.103:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down. Og her fra tos-dipsprod-02: 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 7, node 3 changing it to 7 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 3 is supposed to be the new master, dead=7 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart. 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. -- 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 6, node 6 changing it to 7 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 255 is supposed to be the new master, dead=7 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart. 18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. Morten Kristiansen | Counsellor Helse Nord IKT | Departement of Serviceproduction Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Office address: Amtmann Wors?es gate 63, 8012 Bod?, Norway Quality - Safety - Respect -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130321/63cd8c01/attachment-0001.html
Kristiansen Morten
2013-Apr-11 10:10 UTC
[Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
I've had no response on my problem, is there anybody who can help me on this? Morten K. Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet - Trygghet - Respekt From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Kristiansen Morten Sent: 21. mars 2013 14:47 To: ocfs2-users at oss.oracle.com Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell. Hi, We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and synchronizing the disks. When they rebooted the servers, one of the nodes, tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting disk was not found. Then we tried to reboot that node, causing all nodes to reboot. Time round about 02:25. When examine the /var/log/messages I discovered a BUG message on one of the node that rebooted unexpectedly, tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. Is this a well known bug? Does any body have a solution to this problem? Below is a extract of o2net and ocfs2 messages from the /var/log/message file. /var/log/messages til tos-dipsprod-07: Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-04 (num 5) at 192.168.7.103:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102:7777 has been idle for 10.0 seconds, shutting it down. Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101:7777 has been idle for 10.0 seconds, shutting it down. Og her fra tos-dipsprod-02: 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 7, node 3 changing it to 7 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 3 is supposed to be the new master, dead=7 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart. 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. -- 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 6, node 6 changing it to 7 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 255 is supposed to be the new master, dead=7 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart. 18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. Morten Kristiansen | Counsellor Helse Nord IKT | Departement of Serviceproduction Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Office address: Amtmann Wors?es gate 63, 8012 Bod?, Norway Quality - Safety - Respect -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130411/7fb2dd7e/attachment.html