Hi, I have 6 nodes cluster with OCFS2 1.4.2 running on vmware virtual system RedHat 5.2 (2.6.18-128.1.16.el5) 64bit. Out of 6 nodes two nodes alf0 and alf3 reboot automatically, I enabled remote logging for kernel, and here is log. I noticed VM become non-response and suddenly reboots. I am running Alfresco (documents sharing) application all nodes are accessing common share on OCFS. --------------------------------------------------------- -Jul 22 09:01:25 172.25.29.10 kernel: o2net: connection to node alf3 (num 3) at 172.25.29.13:7777 has been idle for 30.0 secon ds, shutting it down. -Jul 22 09:01:25 172.25.29.10 kernel: (0,1):o2net_idle_timer:1506 here are some times that might help debug the situation: (tm r 1248267655.660420 now 1248267685.655778 dr 1248267655.660405 adv 1248267655.660422:1248267655.660423 func (0ffa2aed:505) 12 48267647.662032:1248267647.662034) -Jul 22 09:01:25 172.25.29.10 kernel: o2net: no longer connected to node alf3 (num 3) at 172.25.29.13:7777 -Jul 22 09:01:25 172.25.29.15 kernel: o2net: connection to node alf3 (num 3) at 172.25.29.13:7777 has been idle for 30.0 secon ds, shutting it down. -Jul 22 09:01:25 172.25.29.15 kernel: (0,0):o2net_idle_timer:1506 here are some times that might help debug the situation: (tm r 1248267655.816401 now 1248267685.812715 dr 1248267655.816401 adv 1248267655.816401:1248267655.816401 func (0ffa2aed:502) 12 48267507.842160:1248267507.842160) -Jul 22 09:01:25 172.25.29.15 kernel: o2net: no longer connected to node alf3 (num 3) at 172.25.29.13:7777 -Jul 22 09:01:55 172.25.29.10 kernel: (2733,1):o2net_connect_expired:1667 ERROR: no connection established with node 3 after 3 0.0 seconds, giving up and returning errors. -Jul 22 09:01:55 172.25.29.15 kernel: (2541,0):o2net_connect_expired:1667 ERROR: no connection established with node 3 after 3 0.0 seconds, giving up and returning errors. How can I know which is having Quorum? And can I move to less busy node. Thanks Raheel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090722/c8fa16b6/attachment.html
Please file a bugzilla and attach the netconsole logs of all six nodes. The messages provided indicate that that node saw the two nodes become unresponsive. As to why they became unresponsive will be known only after we see the netconsole logs of the two nodes. Raheel Akhtar wrote:> > Hi, > > > > I have 6 nodes cluster with OCFS2 1.4.2 running on vmware virtual > system RedHat 5.2 (2.6.18-128.1.16.el5) 64bit. > > > > Out of 6 nodes two nodes alf0 and alf3 reboot automatically, I enabled > remote logging for kernel, and here is log. > > I noticed VM become non-response and suddenly reboots. I am running > Alfresco (documents sharing) application all nodes are accessing > common share on OCFS. > > > > --------------------------------------------------------- > > -Jul 22 09:01:25 172.25.29.10 kernel: o2net: connection to node alf3 > (num 3) at 172.25.29.13:7777 has been idle for 30.0 secon > > ds, shutting it down. > > -Jul 22 09:01:25 172.25.29.10 kernel: (0,1):o2net_idle_timer:1506 here > are some times that might help debug the situation: (tm > > r 1248267655.660420 now 1248267685.655778 dr 1248267655.660405 adv > 1248267655.660422:1248267655.660423 func (0ffa2aed:505) 12 > > 48267647.662032:1248267647.662034) > > -Jul 22 09:01:25 172.25.29.10 kernel: o2net: no longer connected to > node alf3 (num 3) at 172.25.29.13:7777 > > -Jul 22 09:01:25 172.25.29.15 kernel: o2net: connection to node alf3 > (num 3) at 172.25.29.13:7777 has been idle for 30.0 secon > > ds, shutting it down. > > -Jul 22 09:01:25 172.25.29.15 kernel: (0,0):o2net_idle_timer:1506 here > are some times that might help debug the situation: (tm > > r 1248267655.816401 now 1248267685.812715 dr 1248267655.816401 adv > 1248267655.816401:1248267655.816401 func (0ffa2aed:502) 12 > > 48267507.842160:1248267507.842160) > > -Jul 22 09:01:25 172.25.29.15 kernel: o2net: no longer connected to > node alf3 (num 3) at 172.25.29.13:7777 > > -Jul 22 09:01:55 172.25.29.10 kernel: > (2733,1):o2net_connect_expired:1667 ERROR: no connection established > with node 3 after 3 > > 0.0 seconds, giving up and returning errors. > > -Jul 22 09:01:55 172.25.29.15 kernel: > (2541,0):o2net_connect_expired:1667 ERROR: no connection established > with node 3 after 3 > > 0.0 seconds, giving up and returning errors. > > > > > > How can I know which is having Quorum? And can I move to less busy node. > > > > Thanks > > Raheel > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Hi, 1. I tried to setup netconsole but getting NO logs on logging system. 2. How can check which node is having quorum and how to move to different node? Thanks -----Original Message----- From: Sunil Mushran [mailto:sunil.mushran at oracle.com] Sent: Wednesday, July 22, 2009 2:45 PM To: Raheel Akhtar Cc: ocfs2-users at oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 Node restart Please file a bugzilla and attach the netconsole logs of all six nodes. The messages provided indicate that that node saw the two nodes become unresponsive. As to why they became unresponsive will be known only after we see the netconsole logs of the two nodes. Raheel Akhtar wrote:> > Hi, > > > > I have 6 nodes cluster with OCFS2 1.4.2 running on vmware virtual > system RedHat 5.2 (2.6.18-128.1.16.el5) 64bit. > > > > Out of 6 nodes two nodes alf0 and alf3 reboot automatically, I enabled > remote logging for kernel, and here is log. > > I noticed VM become non-response and suddenly reboots. I am running > Alfresco (documents sharing) application all nodes are accessing > common share on OCFS. > > > > --------------------------------------------------------- > > -Jul 22 09:01:25 172.25.29.10 kernel: o2net: connection to node alf3 > (num 3) at 172.25.29.13:7777 has been idle for 30.0 secon > > ds, shutting it down. > > -Jul 22 09:01:25 172.25.29.10 kernel: (0,1):o2net_idle_timer:1506 here > are some times that might help debug the situation: (tm > > r 1248267655.660420 now 1248267685.655778 dr 1248267655.660405 adv > 1248267655.660422:1248267655.660423 func (0ffa2aed:505) 12 > > 48267647.662032:1248267647.662034) > > -Jul 22 09:01:25 172.25.29.10 kernel: o2net: no longer connected to > node alf3 (num 3) at 172.25.29.13:7777 > > -Jul 22 09:01:25 172.25.29.15 kernel: o2net: connection to node alf3 > (num 3) at 172.25.29.13:7777 has been idle for 30.0 secon > > ds, shutting it down. > > -Jul 22 09:01:25 172.25.29.15 kernel: (0,0):o2net_idle_timer:1506 here > are some times that might help debug the situation: (tm > > r 1248267655.816401 now 1248267685.812715 dr 1248267655.816401 adv > 1248267655.816401:1248267655.816401 func (0ffa2aed:502) 12 > > 48267507.842160:1248267507.842160) > > -Jul 22 09:01:25 172.25.29.15 kernel: o2net: no longer connected to > node alf3 (num 3) at 172.25.29.13:7777 > > -Jul 22 09:01:55 172.25.29.10 kernel: > (2733,1):o2net_connect_expired:1667 ERROR: no connection established > with node 3 after 3 > > 0.0 seconds, giving up and returning errors. > > -Jul 22 09:01:55 172.25.29.15 kernel: > (2541,0):o2net_connect_expired:1667 ERROR: no connection established > with node 3 after 3 > > 0.0 seconds, giving up and returning errors. > > > > > > How can I know which is having Quorum? And can I move to less busy node. > > > > Thanks > > Raheel > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users