thr3ads.net - Ocfs2 users - [Ocfs2-users] frequent production node reboots [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Saranya Sivakumar

2008-Oct-27 15:47 UTC

[Ocfs2-users] frequent production node reboots

Hi All,

We have been having frequent node reboots in our 4 node production RAC cluster.
We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM for database
files).
We are using OCFS2 for the cluster files. 

cat /proc/fs/ocfs2/version
OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build
9c7ae8bb50ef6d8791df2912775adcc5)

/etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
  Heartbeat dead threshold: 61
  Network idle timeout: 60000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active

Most recent reboot happened this morning around 9:20 am. 
/var/log/messages on the node that rebooted (db3)
Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at
10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down.
Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1225117376.63515 now 1225117436.561
20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func (f6ed8616:500)
1225117376.63517:1225117376.63644)
Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at
10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down.
Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1225117376.279971 now 1225117436.27
2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func
(f6ed8616:504) 1225117356.455405:1225117356.455412)
Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at
10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down.
Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1225117376.281033 now 1225117436.27
4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func
(f6ed8616:502) 1225117376.281035:1225117376.280774)
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: sendmsg
returned -32 instead of 24
Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1 (num 1) at
10.10.100.51:7777
Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 (num 2) at
10.10.100.52:7777
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size
24 to node db0 (num 0) at 10.10.100.50:7777 failed with -32
Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 (num 0) at
10.10.100.50:7777
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size
24 to node db0 (num 0) at 10.10.100.50:7777 failed with -32
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size
24 to node db1 (num 1) at 10.10.100.51:7777 failed with -32
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size
24 to node db1 (num 1) at 10.10.100.51:7777 failed with -32
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to
0 went down!
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status =
-112
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to
1 went down!
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status =
-107
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to
2 went down!
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status =
-107
Oct 27 09:28:27 db3 syslogd 1.4.1: restart.
Oct 27 09:28:27 db3 syslog: syslogd startup succeeded
>From the logs, it looks like the node that reboots is not able to
communicate with the other nodes for more than 60 seconds and thus reboots
itself.Initially the Idle Timeout was set to 30 seconds, but since we are using bonded
interface, we increased it to 60 seconds, and still the nodes are rebooting.
We also upgraded the NIC drivers to the latest version to make sure that the
driver is not causing such issues.
Now we are not sure what else could be causing the frequent reboots. Please help
us to debug the situation. Please let me know if more information is needed.


 Regards,


Saranya Sivakumar



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081027/c5ae792d/attachment.html

Sunil Mushran

2008-Oct-27 16:50 UTC

head link

[Ocfs2-users] frequent production node reboots

Do upgrade to ocfs2 1.2.9-1. It has a fix for oss bugzilla#919 that could
be causing the timeouts. The symptom for that issue is o2net spinning at
100% shortly before the timeout/fence.


Saranya Sivakumar wrote:> Hi All,
>
> We have been having frequent node reboots in our 4 node production RAC 
> cluster.
> We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM 
> for database files).
> We are using OCFS2 for the cluster files.
>
> cat /proc/fs/ocfs2/version
> OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build 
> 9c7ae8bb50ef6d8791df2912775adcc5)
>
> /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
>   Heartbeat dead threshold: 61
>   Network idle timeout: 60000
>   Network keepalive delay: 2000
>   Network reconnect delay: 2000
> Checking O2CB heartbeat: Active
>
> Most recent reboot happened this morning around 9:20 am.
> /var/log/messages on the node that rebooted (db3)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at 
> 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.63515 
> now 1225117436.561
> 20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func 
> (f6ed8616:500) 1225117376.63517:1225117376.63644)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at 
> 10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.279971 
> now 1225117436.27
> 2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func 
> (f6ed8616:504) 1225117356.455405:1225117356.455412)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at 
> 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.281033 
> now 1225117436.27
> 4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func 
> (f6ed8616:502) 1225117376.281035:1225117376.280774)
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: 
> sendmsg returned -32 instead of 24
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1 
> (num 1) at 10.10.100.51:7777
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 
> (num 2) at 10.10.100.52:7777
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 
> (num 0) at 10.10.100.50:7777
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 0 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -112
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 1 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -107
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 2 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -107
> Oct 27 09:28:27 db3 syslogd 1.4.1: restart.
> Oct 27 09:28:27 db3 syslog: syslogd startup succeeded
>
> From the logs, it looks like the node that reboots is not able to 
> communicate with the other nodes for more than 60 seconds and thus 
> reboots itself.
> Initially the Idle Timeout was set to 30 seconds, but since we are 
> using bonded interface, we increased it to 60 seconds, and still the 
> nodes are rebooting.
> We also upgraded the NIC drivers to the latest version to make sure 
> that the driver is not causing such issues.
> Now we are not sure what else could be causing the frequent reboots. 
> Please help us to debug the situation. Please let me know if more 
> information is needed.
>
>  
> Regards,
>
> Saranya Sivakumar
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Ocfs2 users - Oct 2008 - frequent production node reboots

[Ocfs2-users] frequent production node reboots

[Ocfs2-users] frequent production node reboots