Hi all, I have started experiencing node reboots in our 16 node cluster. This is Red Hat 4 update 8 (kernel 2.6.89.0.26.ELsmp). cat /proc/fs/ocfs2/version OCFS2 1.2.9 Mon Jun 21 20:03:07 PDT 2010 (build 5e8325ec7f66b5189c65c7a8710fe8cb) /etc/init.d/o2cb status Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster UATocfs2: Online Heartbeat dead threshold: 61 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Active Most recent reboot happened early yesterday morning, and resulted in 5 nodes rebooting over the space of just over an hour. /var/log/messages from the first node to reboot shows the following Nov 21 02:05:09 dcapp08 kernel: o2net: connection to node dcapp02 (num 1) at 10.255.255.2:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:09 dcapp08 kernel: (0,0):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.449900 now 1290265509.449647 dr 1290265479.426428 adv 1290265479.44 9905:1290265479.449905 func (d399da91:500) 1290265475.684306:1290265475.700022) Nov 21 02:05:09 dcapp08 kernel: o2net: no longer connected to node dcapp02 (num 1) at 10.255.255.2:10007 Nov 21 02:05:09 dcapp08 kernel: (21573,1):dlm_do_master_request:1418 ERROR: link to 1 went down! Nov 21 02:05:09 dcapp08 kernel: (21573,1):dlm_get_lock_resource:995 ERROR: status = -112 Nov 21 02:05:16 dcapp08 kernel: o2net: connected to node dc-c02b02-app02 (num 1) at 10.255.255.2:10007 Nov 21 02:03:02 dcapp08 syslogd 1.4.1: restart. On one of the other nodes, there is additional information in messages logfile Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb03 (num 10) at 10.255.255.11:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265477.576542 now 1290265507.576428 dr 1290265479.576562 adv 129026547 7.576547:1290265477.576548 func (d399da91:502) 1290260960.121797:1290260960.121799) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp03 (num 2) at 10.255.255.3:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265477.606847 now 1290265507.607428 dr 1290265507.607353 adv 129026547 7.606850:1290265477.606850 func (d399da91:504) 1290265452.506254:1290265452.506310) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp04 (num 3) at 10.255.255.4:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265477.987792 now 1290265507.988438 dr 1290265498.890055 adv 129026547 7.987792:1290265477.987793 func (d399da91:500) 1290265475.987731:1290265475.987739) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp01 (num 0) at 10.255.255.1:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265478.881795 now 1290265508.881459 dr 1290265480.889380 adv 129026547 8.889657:1290265478.889659 func (d399da91:500) 1290265478.881795:1290265478.889640) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp08 (num 7) at 10.255.255.8:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.425622 now 1290265509.425473 dr 1290265481.449378 adv 129026547 9.425678:1290265479.425679 func (d399da91:500) 1290265479.425623:1290265479.425673) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb02 (num 9) at 10.255.255.10:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.439787 now 1290265509.439473 dr 1290265481.439720 adv 129026547 9.439838:1290265479.439839 func (d399da91:500) 1290265479.439788:1290265479.439834) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp07 (num 6) at 10.255.255.7:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.469144 now 1290265509.469475 dr 1290265509.450361 adv 129026547 9.473102:1290265479.473104 func (d399da91:502) 1290265479.469146:1290265479.473087) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcb07 (num 14) at 10.255.255.15:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.473112 now 1290265509.473475 dr 1290265485.575784 adv 129026547 9.480961:1290265479.480963 func (d399da91:500) 1290265479.473113:1290265479.480947) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb04 (num 11) at 10.255.255.12:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.480970 now 1290265509.481477 dr 1290265481.481060 adv 129026547 9.481016:1290265479.481016 func (d399da91:500) 1290265479.480970:1290265479.481012) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb08 (num 15) at 10.255.255.16:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.485633 now 1290265509.485475 dr 1290265481.492977 adv 129026547 9.493484:1290265479.493487 func (d399da91:500) 1290265479.485633:1290265479.493467) Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb01 (num 8) at 10.255.255.9:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.493493 now 1290265509.493475 dr 1290265509.484478 adv 129026547 9.493488:1290265479.484610 func (d399da91:502) 1290265479.493495:1290265479.484606) Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_send_tcp_msg:840 ERROR: sendmsg returned -32 instead of 24 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb01 (num 8) at 10.255.255.9:10007 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_send_tcp_msg:840 ERROR: sendmsg returned -32 instead of 24 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp07 (num 6) at 10.255.255.7:10007 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcdb01 (num 8) at 10.255.255.9:10007 failed with -32 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb03 (num 10) at 10.255.255.11:10007 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp03 (num 2) at 10.255.255.3:10007 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_send_tcp_msg:840 ERROR: sendmsg returned -32 instead of 24 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp04 (num 3) at 10.255.255.4:10007 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp04 (num 3) at 10.255.255.4:10007 failed with -32 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp01 (num 0) at 10.255.255.1:10007 failed with -32 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp01 (num 0) at 10.255.255.1:10007 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp01 (num 0) at 10.255.255.1:10007 failed with -32 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp08 (num 7) at 10.255.255.8:10007 failed with -32 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb02 (num 9) at 10.255.255.10:10007 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp07 (num 6) at 10.255.255.7:10007 failed with -32 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcb07 (num 14) at 10.255.255.15:10007 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb04 (num 11) at 10.255.255.12:10007 Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb08 (num 15) at 10.255.255.16:10007 Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcdb01 (num 8) at 10.255.255.9:10007 failed with -32 Nov 21 02:05:16 dcapp02 kernel: (32175,2):dlm_send_remote_convert_request:398 ERROR: status = -107 Nov 21 02:05:16 dcapp02 kernel: (32175,2):dlm_wait_for_node_death:365 C5C06C9B675D41B99B60DE2EB28CE0F7: waiting 5000ms for notification of death of node 3 Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 0 went down! Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -112 Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 2 went down! Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107 Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 3 went down! Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107 Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 6 went down! Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107 Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 7 went down! Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107 Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 8 went down! Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107 Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 9 went down! Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb03 (num 10) at 10.255.255.11:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp03 (num 2) at 10.255.255.3:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp04 (num 3) at 10.255.255.4:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb02 (num 9) at 10.255.255.10:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp08 (num 7) at 10.255.255.8:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp07 (num 6) at 10.255.255.7:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcb07 (num 14) at 10.255.255.15:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb04 (num 11) at 10.255.255.12:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb01 (num 8) at 10.255.255.9:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb08 (num 15) at 10.255.255.16:10007 Nov 21 02:05:17 dcapp02 kernel: o2net: connected to node dcapp01 (num 0) at 10.255.255.1:10007 Nov 21 02:58:30 dcapp02 kernel: o2net: connection to node dcapp08 (num 7) at 10.255.255.8:10007 has been idle for 30.0 seconds, shutting it down. Nov 21 02:58:30 dcapp02 kernel: (0,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290268680.817422 now 1290268710.817946 dr 1290268680.817412 adv 1290268680.81 7427:1290268680.817428 func (d399da91:500) 1290268678.817606:1290268678.817620) Nov 21 02:58:30 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007 Nov 21 02:59:00 dcapp02 kernel: (31112,2):o2net_connect_expired:1585 ERROR: no connection established with node 7 after 30.0 seconds, giving up and returning errors. Nov 21 03:00:00 dcapp02 last message repeated 2 times Nov 21 03:00:06 dcapp02 kernel: (31112,2):ocfs2_dlm_eviction_cb:119 device (120,1): dlm has evicted node 7 Nov 21 03:00:11 dcapp02 kernel: (15445,0):dlm_restart_lock_mastery:1301 ERROR: node down! 7 Nov 21 03:00:11 dcapp02 kernel: (15445,0):dlm_wait_for_lock_mastery:1118 ERROR: status = -11 Nov 21 03:00:12 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin Nov 21 03:00:45 dcapp02 last message repeated 3 times Nov 21 03:03:29 dcapp02 kernel: o2net: accepted connection from node dcapp08 (num 7) at 10.255.255.8:10007 Nov 21 03:03:30 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin Nov 21 03:04:03 dcapp02 last message repeated 3 times Nov 21 03:04:58 dcapp02 last message repeated 5 times Nov 21 03:05:05 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007 Nov 21 03:05:09 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin Nov 21 03:05:20 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin Nov 21 03:05:21 dcapp02 kernel: o2net: accepted connection from node dcapp08 (num 7) at 10.255.255.8:10007 Nov 21 03:05:31 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin Nov 21 03:06:04 dcapp02 last message repeated 3 times Nov 21 03:06:48 dcapp02 last message repeated 4 times Nov 21 03:06:55 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007 Nov 21 03:06:59 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin Nov 21 03:07:32 dcapp02 last message repeated 3 times I notice that the Idle Timeout is set to 30 seconds, but as we are using bonded interface perhaps I should increase to 60 seconds? Looking at sar report, I couldn't see anything strange about the network over the reboot period. I wonder if anyone has any suggestions on troubleshooting this, as this is the second time in 3 days I have had this issue. Many thanks Neil