thr3ads.net - Ocfs2 users - [Ocfs2-users] CRS/CSS and OCFS2 [May 2008]

If this information is useful, please help other people find it:
Share via:

alexandra.strauss at bayerbbs.com

2008-May-27 13:41 UTC

[Ocfs2-users] CRS/CSS and OCFS2

Hello,

I refer to you hoping you may help me with my problem... We have got an
issur here and opened a SR at Metalink but until now, we got no useful
information in solving our problem. SR-Number is 6855815.994...

We wanted to protect 9i Single-Instance Databases with 10g Clusterware
following the third-party-tool approach. There are no RAC-databases
involved. But we want to achieve high availability as the databases are
business critical systems. We want to make the systems able to relocate to
another machine in case of failure to keep downtimes low... To achieve
this we want to use OCFS2 for the filesystem. Relocate is done by script
with help of CRS.

So we took two systems (byaz05 and byaz10) and installed the following
software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and OCFS2 1.2.8

We found the following Metalinknotes and adjusted the heartbeat and
timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum
Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid
unnessary node fencing, panic and reboot
Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier
insbesondere der Abschnitt zu Fencing und Quorum)
Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or Reboot
Issues
Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat Timeout
Configuration

We did no changes to the CRS/CSS default settings until now.

During HA-testing we watched unexpected behaviour of the system. We
deactivated the bond for private interconnect and expected only one node
to go down. But we faced both nodes going down. As it seems to me one node
was rebooted from OCFS2 and the other one from CRS/CSS.

Timestamp
--------------------------------------------------------------------------------------------------------------
10:21:06 bond1 disabled (eth1)
/var/log/messages byaz05
Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status definitely down
for interface eth1, disabling it
Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5 the
new active one.

10:21:09 bond1 disabled (eth5)
/var/log/messages byaz05
Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status definitely down
for interface eth5, disabling it
Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running without any
active interface !

10:21:23 o2net ? no longer connected
/var/log/messages byaz05
Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node
byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777
/var/log/messages byaz10
Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node
byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777

10:21:27 CSSD failure 134
10:21:29 Reboot initiated by CRS
/var/log/messages byaz05
Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal status 12.
Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134.
Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user oracle
Apr 25 10:21:27 byaz05 logger: Oracle CRS failure. Rebooting for cluster
integrity.
Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices.
Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode.
Apr 25 10:21:29 byaz05 logger: Oracle CRS failure. Rebooting for cluster
integrity.
Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC Link
is Up 1000 Mbps Full Duplex
Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to sibling
27383.

10:21:58 Reboot initiated by OCFS2(?)
/var/log/messages byaz10
Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user oracle
by (uid=0)
Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for user oracle
Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart.
Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded
Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro
root=/dev/vgroot/_)

We supposed all the time this is a timing problem. But we don't know which
settings raise the problem and which steps to do to to correct them.
Otherwise we'll have to work over the complete concept for the business
critical systems.
Can anyone help me?

Regards,
Alexandra

Freundliche Gr??e / Best Regards

Alexandra Strauss
_________________________________________

Fa. Opitz Consulting
Fa. Opitz Consulting
Phone:
Fax:
E-mail:
Web: http://www.BayerBBS.com

Gesch?ftsf?hrung: Vorsitzender Andreas Resch | Arbeitsdirektor Norbert
Fieseler
Vorsitzender des Aufsichtsrats: Klaus K?hn
Sitz der Gesellschaft: Leverkusen | Amtsgericht K?ln, HRB 49895
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080527/bac6f985/attachment.html

Sunil Mushran

2008-May-27 16:24 UTC

head link

[Ocfs2-users] CRS/CSS and OCFS2

In such a situation, ocfs2 fences the higher node number. afaik,
css does the same. What are the css node numbers for the two nodes?

alexandra.strauss at bayerbbs.com wrote:>
> Hello,
>
> I refer to you hoping you may help me with my problem... We have got 
> an issur here and opened a SR at Metalink but until now, we got no 
> useful information in solving our problem. SR-Number is 6855815.994...
>
> We wanted to protect 9i Single-Instance Databases with 10g Clusterware 
> following the third-party-tool approach. There are no RAC-databases 
> involved. But we want to achieve high availability as the databases 
> are business critical systems. We want to make the systems able to
> relocate to another machine in case of failure to keep downtimes 
> low... To achieve this we want to use OCFS2 for the filesystem. 
> Relocate is done by script with help of CRS.
>
> So we took two systems (byaz05 and byaz10) and installed the following 
> software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and OCFS2 1.2.8
>
> We found the following Metalinknotes and adjusted the heartbeat and 
> timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum 
> Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid 
> unnessary node fencing, panic and reboot
> Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier 
> insbesondere der Abschnitt zu Fencing und Quorum)
> Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or 
> Reboot Issues
> Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat 
> Timeout Configuration
>
> We did no changes to the CRS/CSS default settings until now.
>
> During HA-testing we watched unexpected behaviour of the system. We 
> deactivated the bond for private interconnect and expected only one 
> node to go down. But we faced both nodes going down. As it seems to me 
> one node was rebooted from OCFS2 and the other one from CRS/CSS.
>
> Timestamp
>
--------------------------------------------------------------------------------------------------------------
>
> 10:21:06 bond1 disabled (eth1)
> */var/log/messages byaz05*
> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status definitely 
> down for interface eth1, disabling it
> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5 
> the new active one.
>
> 10:21:09 bond1 disabled (eth5)
> */var/log/messages byaz05*
> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status definitely 
> down for interface eth5, disabling it
> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running without any 
> active interface !
>
> 10:21:23 o2net ? no longer connected
> */var/log/messages byaz05*
> Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node 
> byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777
> */var/log/messages byaz10*
> Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node 
> byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777
>
> 10:21:27 CSSD failure 134
> 10:21:29 Reboot initiated by CRS
> */var/log/messages byaz05*
> Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal status 
> 12.
> Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134.
> Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user 
> oracle
> Apr 25 10:21:27 byaz05 logger: Oracle CRS failure. Rebooting for 
> cluster integrity.
> Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices.
> Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode.
> Apr 25 10:21:29 byaz05 logger: Oracle CRS failure. Rebooting for 
> cluster integrity.
> Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC 
> Link is Up 1000 Mbps Full Duplex
> Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to 
> sibling 27383.
>
> 10:21:58 Reboot initiated by OCFS2(?)
> */var/log/messages byaz10*
> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user 
> oracle by (uid=0)
> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for user oracle
> Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart.
> Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded
> Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg 
> started.
> Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro 
> root=/dev/vgroot/_)
>
>
> We supposed all the time this is a timing problem. But we don't know 
> which settings raise the problem and which steps to do to to correct 
> them. Otherwise we'll have to work over the complete concept for the 
> business critical systems.
> Can anyone help me?
>
> Regards,
> Alexandra
>
>
>
>
>
>

Luis Freitas

2008-May-27 20:00 UTC

head link

[Ocfs2-users] CRS/CSS and OCFS2

Alexandra,

&nbsp;&nbsp; You could use only CRS and ext3 instead of ocfs2 for this
kind of use. You would need to register a script to force umount the filesystem
on the primary node and mount it on the node you are failing over to, it would
be nice to be able to check if the filesystem is mounted before atempting to
mount it, but I am not sure on how to do this)

&nbsp;&nbsp; Are you using a cross-over cable for the private
interconnect?

Regards,
Luis

--- On Fri, 6/27/08, alexandra.strauss at bayerbbs.com &lt;alexandra.strauss
at bayerbbs.com&gt; wrote:
From: alexandra.strauss at bayerbbs.com &lt;alexandra.strauss at
bayerbbs.com&gt;
Subject: [Ocfs2-users] CRS/CSS and OCFS2
To: ocfs2-users at oss.oracle.com
Date: Friday, June 27, 2008, 10:41 AM



Hello,



I refer to you hoping you may help me
with my problem... We have got an issur here and opened a SR at Metalink
but until now, we got no useful information in solving our problem. SR-Number
is 6855815.994...



We wanted to protect 9i Single-Instance
Databases with 10g Clusterware following the third-party-tool approach.
There are no RAC-databases involved. But we want to achieve high availability
as the databases are business critical systems. We want to make the systems
able to 
relocate
to another machine in case of failure to keep downtimes low... To achieve
this we want to use OCFS2 for the filesystem. Relocate is done by script
with help of CRS.



So we took two systems (byaz05 and byaz10)
and installed the following software: 10g CRS (10.2.0.3) and Oracle Software
9.2.0.8 and OCFS2 1.2.8



We found the following Metalinknotes
and adjusted the heartbeat and timeouts for OCFS2: Metalink
Note 395878.1: Heartbeat/Voting/Quorum Related Timeout Configuration for
Linux, OCFS2, RAC Stack to avoid unnessary node fencing, panic and reboot

Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier insbesondere
der Abschnitt zu Fencing und Quorum)

Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or Reboot
Issues

Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat Timeout
Configuration



We did no changes to the CRS/CSS default settings until
now.



During HA-testing we watched unexpected
behaviour of the system. We deactivated the bond for private interconnect
and expected only one node to go down. But we faced both nodes going down.
As it seems to me one node was rebooted from OCFS2 and the other one from
CRS/CSS.



Timestamp &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

--------------------------------------------------------------------------------------------------------------

10:21:06 &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;bond1 disabled
(eth1) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp;

/var/log/messages byaz05

Apr 25 10:21:06 byaz05 kernel:
bonding: bond1: link status definitely down for interface eth1, disabling
it

Apr 25 10:21:06 byaz05 kernel:
bonding: bond1: making interface eth5 the new active one.



10:21:09 &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;bond1 disabled
(eth5) &nbsp; &nbsp; &nbsp; &nbsp;

/var/log/messages byaz05

Apr 25 10:21:09 byaz05 kernel:
bonding: bond1: link status definitely down for interface eth5, disabling
it

Apr 25 10:21:09 byaz05 kernel:
bonding: bond1: now running without any active interface !



10:21:23
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp;o2net
? no longer connected &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp;
&nbsp; &nbsp;

/var/log/messages byaz05

Apr 25 10:21:23 byaz05 kernel:
o2net: no longer connected to node byaz10.bayer-ag.com (num 1) at
10.190.59.6:7777

/var/log/messages byaz10

Apr 25 10:21:23 byaz10 kernel:
o2net: no longer connected to node byaz05.bayer-ag.com (num 0) at
10.190.59.5:7777



10:21:27 &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;CSSD failure
134

10:21:29 &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Reboot
initiated
by CRS

/var/log/messages byaz05

Apr 25 10:21:27 byaz05 logger:
Oracle clsomon failed with fatal status 12.

Apr 25 10:21:27 byaz05 logger:
Oracle CSSD failure 134.

Apr 25 10:21:27 byaz05 su(pam_unix)[25839]:
session closed for user oracle

Apr 25 10:21:27 byaz05 logger:
Oracle CRS failure. &nbsp;Rebooting for cluster integrity.

Apr 25 10:21:27 byaz05 kernel:
md: stopping all md devices.

Apr 25 10:21:27 byaz05 kernel:
md: md0 switched to read-only mode.

Apr 25 10:21:29 byaz05 logger:
Oracle CRS failure. &nbsp;Rebooting for cluster integrity.

Apr 25 10:21:29 byaz05 kernel:
e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex

Apr 25 10:21:29 byaz05 logger:
Oracle init script ceding reboot to sibling 27383.



10:21:58 &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Reboot
initiated
by OCFS2(?)

/var/log/messages byaz10

Apr 25 10:21:58 byaz10 su(pam_unix)[4595]:
session opened for user oracle by (uid=0)

Apr 25 10:21:58 byaz10 su(pam_unix)[4595]:
session closed for user oracle

Apr 25 10:25:58 byaz10 syslogd
1.4.1: restart.

Apr 25 10:25:58 byaz10 syslog:
syslogd startup succeeded

Apr 25 10:25:58 byaz10 kernel:
klogd 1.4.1, log source = /proc/kmsg started.

Apr 25 10:25:58 byaz10 kernel:
Bootdata ok (command line is ro root=/dev/vgroot/_)





We supposed all the time this is a timing
problem. But we don't know which settings raise the problem and which steps
to do to to correct them. Otherwise we'll have to work over the complete
concept for the business critical systems. 

Can anyone help me?



Regards,

Alexandra






































Freundliche Gr??e / Best Regards



Alexandra Strauss

_________________________________________



Fa. Opitz Consulting

Fa. Opitz Consulting

Phone: 

Fax: 

E-mail: 

Web: http://www.BayerBBS.com



Gesch?ftsf?hrung: Vorsitzender Andreas Resch &nbsp; | &nbsp;
Arbeitsdirektor
Norbert Fieseler

Vorsitzender des Aufsichtsrats: Klaus K?hn

Sitz der Gesellschaft: Leverkusen &nbsp; | &nbsp; Amtsgericht K?ln, HRB
49895

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080527/524298da/attachment.html

Luis Freitas

2008-May-27 23:39 UTC

head link

[Ocfs2-users] CRS/CSS and OCFS2

Sunil,

&nbsp;&nbsp;&nbsp;&nbsp; You mean that even after the command
completes writes will still be performed? Did some googling on this and seems
that you are right. There is no way to force umount a filesystem on Linux, ugh.
So one would need some way to check if the filesystem is still mounted before
attempting to mount it, or some kind of SAN or device fencing.
&nbsp;&nbsp; 
Regards,
Luis

--- On Tue, 5/27/08, Sunil Mushran &lt;Sunil.Mushran at oracle.com&gt;
wrote:
From: Sunil Mushran &lt;Sunil.Mushran at oracle.com&gt;
Subject: Re: [Ocfs2-users] CRS/CSS and OCFS2
To: lfreitas34 at yahoo.com
Cc: ocfs2-users at oss.oracle.com
Date: Tuesday, May 27, 2008, 7:46 PM

Lazy umount is not the same as forced umount. The processes
that have active descriptors will keep reading and writing to
the fs. The processes are not killed.

Luis Freitas wrote:
&gt; Hmm,
&gt;
&gt;   There is a "lazy" umount:
&gt;
&gt;
&gt;        -l     Lazy unmount. Detach the filesystem from the filesystem 
&gt; hierar-
&gt;               chy  now,  and cleanup all references to the filesystem 
&gt; as soon
&gt;               as it is not busy anymore. This option allows a ?busy? 
&gt; filesys-
&gt;               tem to be unmounted.  (Requires kernel 2.4.11 or later.)
&gt;
&gt;   Not sure if this prevents writes on the filesystem after the umount 
&gt; completes, or if there is some way to fence the device, so these 
&gt; problems would need to be verified.
&gt;
&gt;   This is the way any other cold failover HA solution works, they dont 
&gt; use cluster filesystems. If there is no way to force the filesystem to 
&gt; be umounted or fence the device, it should be possible to configure 
&gt; CRS to evict the primary node before the filesystem is mounted on the 
&gt; secondary node. (Or use some external fence device, like a SAN switch 
&gt; or power appliance).
&gt;
&gt;   Btw, is OCFS2 officially supported with Oracle 9i single instance 
&gt; databases?
&gt;
&gt; Regards,
&gt; Luis
&gt;


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080527/38aab6ef/attachment.html

alexandra.strauss at bayerbbs.com

2008-Jun-05 15:38 UTC

head link

[Ocfs2-users] CRS/CSS and OCFS2

Hi Sunil,

sorry for the delay but I was ill the last 10 days.

a. We do not use a crossover cable between the two nodes. The two systems 
are seated in two SANs in different building with redundant switches and 
HBA's inbetween.

b.ocfs2-node numbers: [oracle at byaz05 etc]$ cat /etc/ocfs2/cluster.conf
node:
        ip_port = 7777
        ip_address = 10.190.59.5
        number = 0
        name = byaz05.bayer-ag.com
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 10.190.59.6
        number = 1
        name = byaz10.bayer-ag.com
        cluster = ocfs2

cluster:
        node_count = 2
        name = ocfs2

Clusterconfiguration css/crs:
/u01/app/oracle/product/crs/log/byaz05/crsd
2008-04-25 09:29:01.855: [  OCRMAS][1210108256]th_master:12: I AM THE NEW 
OCR MASTER at incar 1. Node Number 2
2008-04-25 09:29:01.862: [  OCRRAW][1210108256]proprioo: for disk 0 
(/dev/raw/raw101), id match (1), my id set (1723799148,1710759834) total 
id sets (1), 1st set (1723799148,1710759834), 2nd set (0,0) my votes (1), 
total votes (2)
2008-04-25 09:29:01.862: [  OCRRAW][1210108256]proprioo: for disk 1 
(/dev/raw/raw201), id match (1), my id set (1723799148,1710759834) total 
id sets (1), 1st set (1723799148,1710759834), 2nd set (0,0) my votes (1), 
total votes (2)

/u01/app/oracle/product/crs/log/byaz10/crsd
2008-04-25 10:21:28.781: [  OCRMAS][1210108256]th_master:13: I AM THE NEW 
OCR MASTER at incar 4. Node Number 1
2008-04-25 10:21:28.781: [  OCRMSG][1505941856]prom_rpc:1: NULL con. 
Probably got disconnected due to a remote server failure.
2008-04-25 10:21:29.324: [  OCRRAW][1210108256]proprioo: for disk 0 
(/dev/raw/raw101), id match (1), my id set (1723799148,1710759834) total 
id sets (1), 1st set (1723799148,1710759834), 2nd set (0,0) my votes (1), 
total votes (2)
2008-04-25 10:21:29.324: [  OCRRAW][1210108256]proprioo: for disk 1 
(/dev/raw/raw201), id match (1), my id set (1723799148,1710759834) total 
id sets (1), 1st set (1723799148,1710759834), 2nd set (0,0) my votes (1), 
total votes (2)
2008-04-25 10:21:29.351: [  OCRMAS][1210108256]th_master: Deleted ver keys 
from cache (master)

So the two nodes have the following nodenumbers:



Fencing the node with the higher node number ocfs2 would have fenced 
byaz10 and crs/css would have fenced byaz05. This is exactly the behaviour 
we watched. But how to solve this? Oracle says it's certified to use ocfs2 
with RAC. Then the used software combination is nearly the same as we use 
it. How can the combination of the two systems (ocfs2/css) fencing 
different nodes avoided then?


Greets,
Alex

>In such a situation, ocfs2 fences the higher node number. afaik,
>css does the same. What are the css node numbers for the two nodes?
>alexandra.strauss at bayerbbs.com wrote:
>>
>> Hello,
>>
>> I refer to you hoping you may help me with my problem... We have got 
>> an issur here and opened a SR at Metalink but until now, we got no 
>> useful information in solving our problem. SR-Number is 6855815.994...
>>
>> We wanted to protect 9i Single-Instance Databases with 10g Clusterware 
>> following the third-party-tool approach. There are no RAC-databases 
>> involved. But we want to achieve high availability as the databases 
>> are business critical systems. We want to make the systems able to
>> relocate to another machine in case of failure to keep downtimes 
>> low... To achieve this we want to use OCFS2 for the filesystem. 
>> Relocate is done by script with help of CRS.
>>
>> So we took two systems (byaz05 and byaz10) and installed the following 
>> software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and OCFS2 
1.2.8>>
>> We found the following Metalinknotes and adjusted the heartbeat and 
>> timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum 
>> Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid 
>> unnessary node fencing, panic and reboot
>> Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier 
>> insbesondere der Abschnitt zu Fencing und Quorum)
>> Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or 
>> Reboot Issues
>> Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat 
>> Timeout Configuration
>>
>> We did no changes to the CRS/CSS default settings until now.
>>
>> During HA-testing we watched unexpected behaviour of the system. We 
>> deactivated the bond for private interconnect and expected only one 
>> node to go down. But we faced both nodes going down. As it seems to me 
>> one node was rebooted from OCFS2 and the other one from CRS/CSS.
>>
>> Timestamp
>> --------------------------------------------------------------------------------------------------------------
>>
>> 10:21:06 bond1 disabled (eth1)
>> */var/log/messages byaz05*
>> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status definitely 
>> down for interface eth1, disabling it
>> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5 
>> the new active one.
>>
>> 10:21:09 bond1 disabled (eth5)
>> */var/log/messages byaz05*
>> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status definitely 
>> down for interface eth5, disabling it
>> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running without any 
>> active interface !
>>
>> 10:21:23 o2net ? no longer connected
>> */var/log/messages byaz05*
>> Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node 
>> byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777
>> */var/log/messages byaz10*
>> Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node 
>> byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777
>>
>> 10:21:27 CSSD failure 134
>> 10:21:29 Reboot initiated by CRS
>> */var/log/messages byaz05*
>> Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal status 
>> 12.
>> Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134.
>> Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user 
>> oracle
>> Apr 25 10:21:27 byaz05 logger: Oracle CRS failure. Rebooting for 
>> cluster integrity.
>> Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices.
>> Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode.
>> Apr 25 10:21:29 byaz05 logger: Oracle CRS failure. Rebooting for 
>> cluster integrity.
>> Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC 
>> Link is Up 1000 Mbps Full Duplex
>> Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to 
>> sibling 27383.
>>
>> 10:21:58 Reboot initiated by OCFS2(?)
>> */var/log/messages byaz10*
>> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user 
>> oracle by (uid=0)
>> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for user 
oracle>> Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart.
>> Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded
>> Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg 
>> started.
>> Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro 
>> root=/dev/vgroot/_)
>>
>>
>> We supposed all the time this is a timing problem. But we don't
know
>> which settings raise the problem and which steps to do to to correct 
>> them. Otherwise we'll have to work over the complete concept for
the
>> business critical systems.
>> Can anyone help me?
>>
>> Regards,
>> Alexandra-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080605/c26cf5ba/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 2206 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080605/c26cf5ba/attachment.gif

alexandra.strauss at bayerbbs.com

2008-Jun-05 15:46 UTC

head link

[Ocfs2-users] CRS/CSS and OCFS2

Hi Sunil,

my lotus notes choked on the table from excel... So the two nodes have the 
following nodenumbers:
Node    ocfs2   crs/css
byaz05  0       2
byaz10  1       1

Greets,
Alex

>In such a situation, ocfs2 fences the higher node number. afaik,
>css does the same. What are the css node numbers for the two nodes?
>alexandra.strauss at bayerbbs.com wrote:
>>
>> Hello,
>>
>> I refer to you hoping you may help me with my problem... We have got 
>> an issur here and opened a SR at Metalink but until now, we got no 
>> useful information in solving our problem. SR-Number is 6855815.994...
>>
>> We wanted to protect 9i Single-Instance Databases with 10g Clusterware 
>> following the third-party-tool approach. There are no RAC-databases 
>> involved. But we want to achieve high availability as the databases 
>> are business critical systems. We want to make the systems able to
>> relocate to another machine in case of failure to keep downtimes 
>> low... To achieve this we want to use OCFS2 for the filesystem. 
>> Relocate is done by script with help of CRS.
>>
>> So we took two systems (byaz05 and byaz10) and installed the following 
>> software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and OCFS2 
1.2.8>>
>> We found the following Metalinknotes and adjusted the heartbeat and 
>> timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum 
>> Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid 
>> unnessary node fencing, panic and reboot
>> Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier 
>> insbesondere der Abschnitt zu Fencing und Quorum)
>> Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or 
>> Reboot Issues
>> Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat 
>> Timeout Configuration
>>
>> We did no changes to the CRS/CSS default settings until now.
>>
>> During HA-testing we watched unexpected behaviour of the system. We 
>> deactivated the bond for private interconnect and expected only one 
>> node to go down. But we faced both nodes going down. As it seems to me 
>> one node was rebooted from OCFS2 and the other one from CRS/CSS.
>>
>> Timestamp
>> --------------------------------------------------------------------------------------------------------------
>>
>> 10:21:06 bond1 disabled (eth1)
>> */var/log/messages byaz05*
>> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status definitely 
>> down for interface eth1, disabling it
>> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5 
>> the new active one.
>>
>> 10:21:09 bond1 disabled (eth5)
>> */var/log/messages byaz05*
>> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status definitely 
>> down for interface eth5, disabling it
>> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running without any 
>> active interface !
>>
>> 10:21:23 o2net ? no longer connected
>> */var/log/messages byaz05*
>> Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node 
>> byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777
>> */var/log/messages byaz10*
>> Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node 
>> byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777
>>
>> 10:21:27 CSSD failure 134
>> 10:21:29 Reboot initiated by CRS
>> */var/log/messages byaz05*
>> Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal status 
>> 12.
>> Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134.
>> Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user 
>> oracle
>> Apr 25 10:21:27 byaz05 logger: Oracle CRS failure. Rebooting for 
>> cluster integrity.
>> Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices.
>> Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode.
>> Apr 25 10:21:29 byaz05 logger: Oracle CRS failure. Rebooting for 
>> cluster integrity.
>> Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC 
>> Link is Up 1000 Mbps Full Duplex
>> Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to 
>> sibling 27383.
>>
>> 10:21:58 Reboot initiated by OCFS2(?)
>> */var/log/messages byaz10*
>> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user 
>> oracle by (uid=0)
>> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for user 
oracle>> Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart.
>> Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded
>> Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg 
>> started.
>> Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro 
>> root=/dev/vgroot/_)
>>
>>
>> We supposed all the time this is a timing problem. But we don't
know
>> which settings raise the problem and which steps to do to to correct 
>> them. Otherwise we'll have to work over the complete concept for
the
>> business critical systems.
>> Can anyone help me?
>>
>> Regards,
>> Alexandra-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080605/33e961c1/attachment-0001.html

Ocfs2 users - May 2008 - CRS/CSS and OCFS2

[Ocfs2-users] CRS/CSS and OCFS2

[Ocfs2-users] CRS/CSS and OCFS2

[Ocfs2-users] CRS/CSS and OCFS2

[Ocfs2-users] CRS/CSS and OCFS2

[Ocfs2-users] CRS/CSS and OCFS2

[Ocfs2-users] CRS/CSS and OCFS2