Hi All, We have been having frequent node reboots in our 4 node production RAC cluster. We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM for database files). We are using OCFS2 for the cluster files. cat /proc/fs/ocfs2/version OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build 9c7ae8bb50ef6d8791df2912775adcc5) /etc/init.d/o2cb status Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Heartbeat dead threshold: 61 Network idle timeout: 60000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Active Most recent reboot happened this morning around 9:20 am. /var/log/messages on the node that rebooted (db3) Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down. Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1225117376.63515 now 1225117436.561 20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func (f6ed8616:500) 1225117376.63517:1225117376.63644) Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at 10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down. Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1225117376.279971 now 1225117436.27 2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func (f6ed8616:504) 1225117356.455405:1225117356.455412) Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down. Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1225117376.281033 now 1225117436.27 4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func (f6ed8616:502) 1225117376.281035:1225117376.280774) Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: sendmsg returned -32 instead of 24 Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1 (num 1) at 10.10.100.51:7777 Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 (num 2) at 10.10.100.52:7777 Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed with -32 Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 (num 0) at 10.10.100.50:7777 Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed with -32 Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed with -32 Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed with -32 Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to 0 went down! Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status = -112 Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to 1 went down! Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status = -107 Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to 2 went down! Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status = -107 Oct 27 09:28:27 db3 syslogd 1.4.1: restart. Oct 27 09:28:27 db3 syslog: syslogd startup succeeded>From the logs, it looks like the node that reboots is not able to communicate with the other nodes for more than 60 seconds and thus reboots itself.Initially the Idle Timeout was set to 30 seconds, but since we are using bonded interface, we increased it to 60 seconds, and still the nodes are rebooting. We also upgraded the NIC drivers to the latest version to make sure that the driver is not causing such issues. Now we are not sure what else could be causing the frequent reboots. Please help us to debug the situation. Please let me know if more information is needed. Regards, Saranya Sivakumar -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081027/c5ae792d/attachment.html
Do upgrade to ocfs2 1.2.9-1. It has a fix for oss bugzilla#919 that could be causing the timeouts. The symptom for that issue is o2net spinning at 100% shortly before the timeout/fence. Saranya Sivakumar wrote:> Hi All, > > We have been having frequent node reboots in our 4 node production RAC > cluster. > We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM > for database files). > We are using OCFS2 for the cluster files. > > cat /proc/fs/ocfs2/version > OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build > 9c7ae8bb50ef6d8791df2912775adcc5) > > /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Heartbeat dead threshold: 61 > Network idle timeout: 60000 > Network keepalive delay: 2000 > Network reconnect delay: 2000 > Checking O2CB heartbeat: Active > > Most recent reboot happened this morning around 9:20 am. > /var/log/messages on the node that rebooted (db3) > Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at > 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down. > Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are > some times that might help debug the situation: (tmr 1225117376.63515 > now 1225117436.561 > 20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func > (f6ed8616:500) 1225117376.63517:1225117376.63644) > Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at > 10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down. > Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are > some times that might help debug the situation: (tmr 1225117376.279971 > now 1225117436.27 > 2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func > (f6ed8616:504) 1225117356.455405:1225117356.455412) > Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at > 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down. > Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are > some times that might help debug the situation: (tmr 1225117376.281033 > now 1225117436.27 > 4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func > (f6ed8616:502) 1225117376.281035:1225117376.280774) > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: > sendmsg returned -32 instead of 24 > Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1 > (num 1) at 10.10.100.51:7777 > Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 > (num 2) at 10.10.100.52:7777 > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: > sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed > with -32 > Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 > (num 0) at 10.10.100.50:7777 > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: > sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed > with -32 > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: > sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed > with -32 > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: > sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed > with -32 > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 > ERROR: link to 0 went down! > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: > status = -112 > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 > ERROR: link to 1 went down! > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: > status = -107 > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 > ERROR: link to 2 went down! > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: > status = -107 > Oct 27 09:28:27 db3 syslogd 1.4.1: restart. > Oct 27 09:28:27 db3 syslog: syslogd startup succeeded > > From the logs, it looks like the node that reboots is not able to > communicate with the other nodes for more than 60 seconds and thus > reboots itself. > Initially the Idle Timeout was set to 30 seconds, but since we are > using bonded interface, we increased it to 60 seconds, and still the > nodes are rebooting. > We also upgraded the NIC drivers to the latest version to make sure > that the driver is not causing such issues. > Now we are not sure what else could be causing the frequent reboots. > Please help us to debug the situation. Please let me know if more > information is needed. > > > Regards, > > Saranya Sivakumar > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users