Hi, We have a 4-node production cluster running Oracle 10.2.0.2 RAC database using Oracle Clusterware. The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 1.2.3 version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60. Almost frequently, we have one of the nodes getting a kernel panic dues to ocfs2. We see messages similar to the following in the alert log Reconfiguration started (old inc 8, new inc 10) List of nodes: 0 2 3 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished * domain 0 not valid according to instance 3 * domain 0 not valid according to instance 2 Mon Feb 4 15:28:40 2008 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Mon Feb 4 15:28:40 2008 LMS 0: 10 GCS shadows cancelled, 3 closed ******************************************************************************* /var/sys/messages on one of the surviving nodes shows the following Feb 4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at 10.10.100.51:7777 has been idle for 10 seconds, shutting it down. Feb 4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1202160262.751965 now 1202160272.750632 dr 1202160262.751951 adv 1202160262.751968:1202160262.751970 func (f6ed8616:502) 1202119336.222326:1202119336.222328) Feb 4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num 1) at 10.10.100.51:7777 Feb 4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device (120,386): dlm has evicted node 1 Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847 138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) torecover before lock mastery can begin Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874 138B67103BE042A784A6D419278F891D: recovery map is not empty, but must master $RECOVERY lock now ******************************************************************************* cat /proc/version Linux version 2.6.9-42.0.3.ELsmp (brewbuilder@hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006 ******************************************************************************* cat /etc/sysconfig/o2cb # # This is a configuration file for automatic startup of the O2CB # driver. It is generated by running /etc/init.d/o2cb configure. # Please use that method to modify this file # # O2CB_ENABELED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=ocfs2 # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=60 ******************************************************************************* Our system administrator found that the NIC hung up right before we lost the node. We are guessing that by the time the NIC (probably) could have come back up, the cluster declared the node as dead and evicted it. This has been happening frequently, but we are not sure what is the root cause for it. Would setting the keepalive timeout avoid the instance eviction? Are there options to set network idle time out and keepalive timeout with ocfs2 1.2.3? We are considering upgrading ocfs2 to 1.2.5, but would like to create a temporary workaround before we deploy it on production. Please give us your suggestions and help us fix this problem from re-occuring. Thanks, Sincerely, Saranya Sivakumar Database Administrator ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080204/a7a1172c/attachment.html
The useful info is the oops stack trace. The messages provided are standard messages not relevant to the problem per se. Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old. My suggestion would be to upgrade. We are about to release 1.2.8 shortly. Saranya Sivakumar wrote:> Hi, > > We have a 4-node production cluster running Oracle 10.2.0.2 RAC > database using Oracle Clusterware. > The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 > 1.2.3 version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60. > > Almost frequently, we have one of the nodes getting a kernel panic > dues to ocfs2. > > We see messages similar to the following in the alert log > > Reconfiguration started (old inc 8, new inc 10) > List of nodes: > 0 2 3 > Global Resource Directory frozen > * dead instance detected - domain 0 invalid = TRUE > Communication channels reestablished > * domain 0 not valid according to instance 3 > * domain 0 not valid according to instance 2 > Mon Feb 4 15:28:40 2008 > Master broadcasted resource hash value bitmaps > Non-local Process blocks cleaned out > Mon Feb 4 15:28:40 2008 > LMS 0: 10 GCS shadows cancelled, 3 closed > ******************************************************************************* > /var/sys/messages on one of the surviving nodes shows the following > > Feb 4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at > 10.10.100.51:7777 has been idle for 10 seconds, shutting it down. > Feb 4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some > times that might help debug the situation: (tmr 1202160262.751965 now > 1202160272.750632 dr 1202160262.751951 > adv 1202160262.751968:1202160262.751970 func (f6ed8616:502) > 1202119336.222326:1202119336.222328) > Feb 4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num > 1) at 10.10.100.51:7777 > Feb 4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device > (120,386): dlm has evicted node 1 > Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847 > 138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) > torecover before lock mastery can begin > Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874 > 138B67103BE042A784A6D419278F891D: recovery map is not empty, but must > master $RECOVERY lock now > ******************************************************************************* > cat /proc/version > Linux version 2.6.9-42.0.3.ELsmp > (brewbuilder@hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 > (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006 > ******************************************************************************* > cat /etc/sysconfig/o2cb > # > # This is a configuration file for automatic startup of the O2CB > # driver. It is generated by running /etc/init.d/o2cb configure. > # Please use that method to modify this file > # > > # O2CB_ENABELED: 'true' means to load the driver on boot. > O2CB_ENABLED=true > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. > O2CB_BOOTCLUSTER=ocfs2 > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. > O2CB_HEARTBEAT_THRESHOLD=60 > ******************************************************************************* > > Our system administrator found that the NIC hung up right before we > lost the node. > We are guessing that by the time the NIC (probably) could have come > back up, the cluster declared the node as dead and evicted it. > > This has been happening frequently, but we are not sure what is the > root cause for it. > Would setting the keepalive timeout avoid the instance eviction? > Are there options to set network idle time out and keepalive timeout > with ocfs2 1.2.3? > > We are considering upgrading ocfs2 to 1.2.5, but would like to create > a temporary workaround before we deploy it on production. > Please give us your suggestions and help us fix this problem from > re-occuring. > > Thanks, > Sincerely, > Saranya Sivakumar > > Database Administrator > > > ------------------------------------------------------------------------ > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > it now. > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Hi, I will check my with my systems administration dept if they captured the oops stack trace and post back here. We have never done an upgrade of the ocfs2 on production. Where can I find documentation on how to upgrade? Regards, Saranya Sivakumar Date: Mon, 04 Feb 2008 15:03:56 -0800 From: Sunil Mushran <Sunil.Mushran@oracle.com> Subject: Re: [Ocfs2-users] ocfs2 kernel panic To: Saranya Sivakumar <sarlavk@yahoo.com> Cc: ocfs2-users@oss.oracle.com Message-ID: <47A799DC.1040309@oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed The useful info is the oops stack trace. The messages provided are standard messages not relevant to the problem per se. Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old. My suggestion would be to upgrade. We are about to release 1.2.8 shortly. ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080206/bcd3b27e/attachment.html
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#UPGRADE Saranya Sivakumar wrote:> Hi, > > I will check my with my systems administration dept if they captured > the oops stack trace and post back here. > We have never done an upgrade of the ocfs2 on production. Where can I > find documentation on how to upgrade? > > Regards, > Saranya Sivakumar > > > Date: Mon, 04 Feb 2008 15:03:56 -0800 > From: Sunil Mushran <Sunil.Mushran@oracle.com > <mailto:Sunil.Mushran@oracle.com>> > Subject: Re: [Ocfs2-users] ocfs2 kernel panic > To: Saranya Sivakumar <sarlavk@yahoo.com <mailto:sarlavk@yahoo.com>> > Cc: ocfs2-users@oss.oracle.com <mailto:ocfs2-users@oss.oracle.com> > Message-ID: <47A799DC.1040309@oracle.com > <mailto:47A799DC.1040309@oracle.com>> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > The useful info is the oops stack trace. The messages provided > are standard messages not relevant to the problem per se. > > Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old. > My suggestion would be to upgrade. We are about to release > 1.2.8 shortly. > > > > ------------------------------------------------------------------------ > Looking for last minute shopping deals? Find them fast with Yahoo! > Search. > <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping> > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users