Hi,
We have a 4-node production cluster running Oracle 10.2.0.2 RAC database using
Oracle Clusterware.
The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 1.2.3
version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60.
Almost frequently, we have one of the nodes getting a kernel panic dues to
ocfs2.
We see messages similar to the following in the alert log
Reconfiguration started (old inc 8, new inc 10)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 3
* domain 0 not valid according to instance 2
Mon Feb 4 15:28:40 2008
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Feb 4 15:28:40 2008
LMS 0: 10 GCS shadows cancelled, 3 closed
*******************************************************************************
/var/sys/messages on one of the surviving nodes shows the following
Feb 4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at
10.10.100.51:7777 has been idle for 10 seconds, shutting it down.
Feb 4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some times that
might help debug the situation: (tmr 1202160262.751965 now 1202160272.750632 dr
1202160262.751951
adv 1202160262.751968:1202160262.751970 func (f6ed8616:502)
1202119336.222326:1202119336.222328)
Feb 4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num 1) at
10.10.100.51:7777
Feb 4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device
(120,386): dlm has evicted node 1
Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847
138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) torecover
before lock mastery can begin
Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874
138B67103BE042A784A6D419278F891D: recovery map is not empty, but must master
$RECOVERY lock now
*******************************************************************************
cat /proc/version
Linux version 2.6.9-42.0.3.ELsmp (brewbuilder@hs20-bc2-2.build.redhat.com) (gcc
version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006
*******************************************************************************
cat /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver. It is generated by running /etc/init.d/o2cb configure.
# Please use that method to modify this file
#
# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true
# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2
# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=60
*******************************************************************************
Our system administrator found that the NIC hung up right before we lost the
node.
We are guessing that by the time the NIC (probably) could have come back up, the
cluster declared the node as dead and evicted it.
This has been happening frequently, but we are not sure what is the root cause
for it.
Would setting the keepalive timeout avoid the instance eviction?
Are there options to set network idle time out and keepalive timeout with ocfs2
1.2.3?
We are considering upgrading ocfs2 to 1.2.5, but would like to create a
temporary workaround before we deploy it on production.
Please give us your suggestions and help us fix this problem from re-occuring.
Thanks,
Sincerely,
Saranya Sivakumar
Database Administrator
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080204/a7a1172c/attachment.html
The useful info is the oops stack trace. The messages provided are standard messages not relevant to the problem per se. Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old. My suggestion would be to upgrade. We are about to release 1.2.8 shortly. Saranya Sivakumar wrote:> Hi, > > We have a 4-node production cluster running Oracle 10.2.0.2 RAC > database using Oracle Clusterware. > The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 > 1.2.3 version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60. > > Almost frequently, we have one of the nodes getting a kernel panic > dues to ocfs2. > > We see messages similar to the following in the alert log > > Reconfiguration started (old inc 8, new inc 10) > List of nodes: > 0 2 3 > Global Resource Directory frozen > * dead instance detected - domain 0 invalid = TRUE > Communication channels reestablished > * domain 0 not valid according to instance 3 > * domain 0 not valid according to instance 2 > Mon Feb 4 15:28:40 2008 > Master broadcasted resource hash value bitmaps > Non-local Process blocks cleaned out > Mon Feb 4 15:28:40 2008 > LMS 0: 10 GCS shadows cancelled, 3 closed > ******************************************************************************* > /var/sys/messages on one of the surviving nodes shows the following > > Feb 4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at > 10.10.100.51:7777 has been idle for 10 seconds, shutting it down. > Feb 4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some > times that might help debug the situation: (tmr 1202160262.751965 now > 1202160272.750632 dr 1202160262.751951 > adv 1202160262.751968:1202160262.751970 func (f6ed8616:502) > 1202119336.222326:1202119336.222328) > Feb 4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num > 1) at 10.10.100.51:7777 > Feb 4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device > (120,386): dlm has evicted node 1 > Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847 > 138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) > torecover before lock mastery can begin > Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874 > 138B67103BE042A784A6D419278F891D: recovery map is not empty, but must > master $RECOVERY lock now > ******************************************************************************* > cat /proc/version > Linux version 2.6.9-42.0.3.ELsmp > (brewbuilder@hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 > (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006 > ******************************************************************************* > cat /etc/sysconfig/o2cb > # > # This is a configuration file for automatic startup of the O2CB > # driver. It is generated by running /etc/init.d/o2cb configure. > # Please use that method to modify this file > # > > # O2CB_ENABELED: 'true' means to load the driver on boot. > O2CB_ENABLED=true > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. > O2CB_BOOTCLUSTER=ocfs2 > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. > O2CB_HEARTBEAT_THRESHOLD=60 > ******************************************************************************* > > Our system administrator found that the NIC hung up right before we > lost the node. > We are guessing that by the time the NIC (probably) could have come > back up, the cluster declared the node as dead and evicted it. > > This has been happening frequently, but we are not sure what is the > root cause for it. > Would setting the keepalive timeout avoid the instance eviction? > Are there options to set network idle time out and keepalive timeout > with ocfs2 1.2.3? > > We are considering upgrading ocfs2 to 1.2.5, but would like to create > a temporary workaround before we deploy it on production. > Please give us your suggestions and help us fix this problem from > re-occuring. > > Thanks, > Sincerely, > Saranya Sivakumar > > Database Administrator > > > ------------------------------------------------------------------------ > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > it now. > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Hi,
I will check my with my systems administration dept if they captured the oops
stack trace and post back here.
We have never done an upgrade of the ocfs2 on production. Where can I find
documentation on how to upgrade?
Regards, Saranya Sivakumar
Date:
Mon,
04
Feb
2008
15:03:56
-0800
From:
Sunil
Mushran
<Sunil.Mushran@oracle.com>
Subject:
Re:
[Ocfs2-users]
ocfs2
kernel
panic
To:
Saranya
Sivakumar
<sarlavk@yahoo.com>
Cc:
ocfs2-users@oss.oracle.com
Message-ID:
<47A799DC.1040309@oracle.com>
Content-Type:
text/plain;
charset=ISO-8859-1;
format=flowed
The
useful
info
is
the
oops
stack
trace.
The
messages
provided
are
standard
messages
not
relevant
to
the
problem
per
se.
Having
said
that,
1.2.3
is
1.5yrs
old.
Even
1.2.4
is
a
year
old.
My
suggestion
would
be
to
upgrade.
We
are
about
to
release
1.2.8
shortly.
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080206/bcd3b27e/attachment.html
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#UPGRADE Saranya Sivakumar wrote:> Hi, > > I will check my with my systems administration dept if they captured > the oops stack trace and post back here. > We have never done an upgrade of the ocfs2 on production. Where can I > find documentation on how to upgrade? > > Regards, > Saranya Sivakumar > > > Date: Mon, 04 Feb 2008 15:03:56 -0800 > From: Sunil Mushran <Sunil.Mushran@oracle.com > <mailto:Sunil.Mushran@oracle.com>> > Subject: Re: [Ocfs2-users] ocfs2 kernel panic > To: Saranya Sivakumar <sarlavk@yahoo.com <mailto:sarlavk@yahoo.com>> > Cc: ocfs2-users@oss.oracle.com <mailto:ocfs2-users@oss.oracle.com> > Message-ID: <47A799DC.1040309@oracle.com > <mailto:47A799DC.1040309@oracle.com>> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > The useful info is the oops stack trace. The messages provided > are standard messages not relevant to the problem per se. > > Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old. > My suggestion would be to upgrade. We are about to release > 1.2.8 shortly. > > > > ------------------------------------------------------------------------ > Looking for last minute shopping deals? Find them fast with Yahoo! > Search. > <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping> > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users