thr3ads.net - Ocfs2 users - [Ocfs2-users] ocfs2 kernel panic [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Saranya Sivakumar

2008-Feb-04 14:55 UTC

[Ocfs2-users] ocfs2 kernel panic

Hi,

We have a 4-node production cluster running Oracle 10.2.0.2 RAC database using
Oracle Clusterware.
The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 1.2.3
version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60.

Almost frequently, we have one of the nodes getting a kernel panic dues to
ocfs2.

We see messages similar to the following in the alert log

Reconfiguration started (old inc 8, new inc 10)
List of nodes:
 0 2 3
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 not valid according to instance 3
 * domain 0 not valid according to instance 2
Mon Feb  4 15:28:40 2008
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Mon Feb  4 15:28:40 2008
 LMS 0: 10 GCS shadows cancelled, 3 closed
*******************************************************************************
/var/sys/messages on one of the surviving nodes shows the following

Feb  4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at
10.10.100.51:7777 has been idle for 10 seconds, shutting it down.
Feb  4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some times that
might help debug the situation: (tmr 1202160262.751965 now 1202160272.750632 dr
1202160262.751951
adv 1202160262.751968:1202160262.751970 func (f6ed8616:502)
1202119336.222326:1202119336.222328)
Feb  4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num 1) at
10.10.100.51:7777
Feb  4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device
(120,386): dlm has evicted node 1
Feb  4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847
138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) torecover
before lock mastery can begin
Feb  4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874
138B67103BE042A784A6D419278F891D: recovery map is not empty, but must master
$RECOVERY lock now
*******************************************************************************
cat /proc/version
Linux version 2.6.9-42.0.3.ELsmp (brewbuilder@hs20-bc2-2.build.redhat.com) (gcc
version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006
*******************************************************************************
cat /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# Please use that method to modify this file
#

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=60
*******************************************************************************

Our system administrator found that the NIC hung up right before we lost the
node.
We are guessing that by the time the NIC (probably) could have come back up, the
cluster declared the node as dead and evicted it.

This has been happening frequently, but we are not sure what is the root cause
for it.
Would setting the keepalive timeout avoid the instance eviction?
Are there options to set network idle time out and keepalive timeout with ocfs2
1.2.3?

We are considering upgrading ocfs2 to 1.2.5, but would like to create a
temporary workaround before we deploy it on production.
Please give us your suggestions and help us fix this problem from re-occuring.

Thanks,
Sincerely,
Saranya Sivakumar 

Database Administrator




     
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080204/a7a1172c/attachment.html

Sunil Mushran

2008-Feb-04 15:04 UTC

head link

[Ocfs2-users] ocfs2 kernel panic

The useful info is the oops stack trace. The messages provided
are standard messages not relevant to the problem per se.

Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old.
My suggestion would be to upgrade. We are about to release
1.2.8 shortly.

Saranya Sivakumar wrote:> Hi,
>
> We have a 4-node production cluster running Oracle 10.2.0.2 RAC 
> database using Oracle Clusterware.
> The cluster files are shared on a ocfs2 filesystem. We are using ocfs2 
> 1.2.3 version and we have the O2CB_HEARTBEAT_THRESHOLD set to 60.
>
> Almost frequently, we have one of the nodes getting a kernel panic 
> dues to ocfs2.
>
> We see messages similar to the following in the alert log
>
> Reconfiguration started (old inc 8, new inc 10)
> List of nodes:
>  0 2 3
>  Global Resource Directory frozen
>  * dead instance detected - domain 0 invalid = TRUE
>  Communication channels reestablished
>  * domain 0 not valid according to instance 3
>  * domain 0 not valid according to instance 2
> Mon Feb  4 15:28:40 2008
>  Master broadcasted resource hash value bitmaps
>  Non-local Process blocks cleaned out
> Mon Feb  4 15:28:40 2008
>  LMS 0: 10 GCS shadows cancelled, 3 closed
>
*******************************************************************************
> /var/sys/messages on one of the surviving nodes shows the following
>
> Feb 4 15:24:32 db0 kernel: o2net: connection to node db1 (num 1) at 
> 10.10.100.51:7777 has been idle for 10 seconds, shutting it down.
> Feb 4 15:24:32 db0 kernel: (0,2):o2net_idle_timer:1309 here are some 
> times that might help debug the situation: (tmr 1202160262.751965 now 
> 1202160272.750632 dr 1202160262.751951
> adv 1202160262.751968:1202160262.751970 func (f6ed8616:502) 
> 1202119336.222326:1202119336.222328)
> Feb 4 15:24:32 db0 kernel: o2net: no longer connected to node db1 (num 
> 1) at 10.10.100.51:7777
> Feb 4 15:28:39 db0 kernel: (13349,2):ocfs2_dlm_eviction_cb:119 device 
> (120,386): dlm has evicted node 1
> Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:847 
> 138B67103BE042A784A6D419278F891D:$RECOVERY: at least one node (1) 
> torecover before lock mastery can begin
> Feb 4 15:28:41 db0 kernel: (15944,7):dlm_get_lock_resource:874 
> 138B67103BE042A784A6D419278F891D: recovery map is not empty, but must 
> master $RECOVERY lock now
>
*******************************************************************************
> cat /proc/version
> Linux version 2.6.9-42.0.3.ELsmp 
> (brewbuilder@hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 
> (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:24:31 EDT 2006
>
*******************************************************************************
> cat /etc/sysconfig/o2cb
> #
> # This is a configuration file for automatic startup of the O2CB
> # driver.  It is generated by running /etc/init.d/o2cb configure.
> # Please use that method to modify this file
> #
>
> # O2CB_ENABELED: 'true' means to load the driver on boot.
> O2CB_ENABLED=true
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=ocfs2
>
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=60
>
*******************************************************************************
>
> Our system administrator found that the NIC hung up right before we 
> lost the node.
> We are guessing that by the time the NIC (probably) could have come 
> back up, the cluster declared the node as dead and evicted it.
>
> This has been happening frequently, but we are not sure what is the 
> root cause for it.
> Would setting the keepalive timeout avoid the instance eviction?
> Are there options to set network idle time out and keepalive timeout 
> with ocfs2 1.2.3?
>
> We are considering upgrading ocfs2 to 1.2.5, but would like to create 
> a temporary workaround before we deploy it on production.
> Please give us your suggestions and help us fix this problem from 
> re-occuring.
>
> Thanks,
> Sincerely,
> Saranya Sivakumar
>
> Database Administrator
>
>
> ------------------------------------------------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try 
> it now. 
>
<http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Saranya Sivakumar

2008-Feb-06 07:49 UTC

head link

[Ocfs2-users] Re:Re:ocfs2 kernel panic

Hi,

I will check my with my systems administration dept if they captured the oops
stack trace and post back here.
We have never done an upgrade of the ocfs2 on production. Where can I find
documentation on how to upgrade?
 
Regards, Saranya Sivakumar

Date: 
Mon, 
04 
Feb 
2008 
15:03:56 
-0800
From: 
Sunil 
Mushran 
<Sunil.Mushran@oracle.com>
Subject: 
Re: 
[Ocfs2-users] 
ocfs2 
kernel 
panic
To: 
Saranya 
Sivakumar 
<sarlavk@yahoo.com>
Cc: 
ocfs2-users@oss.oracle.com
Message-ID: 
<47A799DC.1040309@oracle.com>
Content-Type: 
text/plain; 
charset=ISO-8859-1; 
format=flowed

The 
useful 
info 
is 
the 
oops 
stack 
trace. 
The 
messages 
provided
are 
standard 
messages 
not 
relevant 
to 
the 
problem 
per 
se.

Having 
said 
that, 
1.2.3 
is 
1.5yrs 
old. 
Even 
1.2.4 
is 
a 
year 
old.
My 
suggestion 
would 
be 
to 
upgrade. 
We 
are 
about 
to 
release
1.2.8 
shortly.









     
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080206/bcd3b27e/attachment.html

Sunil Mushran

2008-Feb-06 09:34 UTC

head link

[Ocfs2-users] Re:Re:ocfs2 kernel panic

http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#UPGRADE

Saranya Sivakumar wrote:> Hi,
>
> I will check my with my systems administration dept if they captured 
> the oops stack trace and post back here.
> We have never done an upgrade of the ocfs2 on production. Where can I 
> find documentation on how to upgrade?
>  
> Regards, 
> Saranya Sivakumar
>
>
> Date: Mon, 04 Feb 2008 15:03:56 -0800
> From: Sunil Mushran <Sunil.Mushran@oracle.com 
> <mailto:Sunil.Mushran@oracle.com>>
> Subject: Re: [Ocfs2-users] ocfs2 kernel panic
> To: Saranya Sivakumar <sarlavk@yahoo.com
<mailto:sarlavk@yahoo.com>>
> Cc: ocfs2-users@oss.oracle.com <mailto:ocfs2-users@oss.oracle.com>
> Message-ID: <47A799DC.1040309@oracle.com 
> <mailto:47A799DC.1040309@oracle.com>>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> The useful info is the oops stack trace. The messages provided
> are standard messages not relevant to the problem per se.
>
> Having said that, 1.2.3 is 1.5yrs old. Even 1.2.4 is a year old.
> My suggestion would be to upgrade. We are about to release
> 1.2.8 shortly.
>
>
>
> ------------------------------------------------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo! 
> Search. 
>
<http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Ocfs2 users - Feb 2008 - ocfs2 kernel panic

[Ocfs2-users] ocfs2 kernel panic

[Ocfs2-users] ocfs2 kernel panic

[Ocfs2-users] Re:Re:ocfs2 kernel panic

[Ocfs2-users] Re:Re:ocfs2 kernel panic