thr3ads.net - Ocfs2 users - [Ocfs2-users] RAC and OCFS2 timeout interaction issues [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Kolstee, Ronald A (Tony)

2010-Nov-22 21:56 UTC

[Ocfs2-users] RAC and OCFS2 timeout interaction issues

We have a RAC cluster as follows:

 *   3 nodes with RHEL5, Oracle 11.1.0.7, and OCFS2 1.4.2
 *   Voting and OCR are on OCFS2, all other shared storage is on ASM
 *   Storage hardware is provided by a fibre-channel SAN fabric
 *   Interconnect uses two bonded NICs per server, connected to different blades
on a single switch

Previously we had an issue where all three nodes would reboot if one node had
problems. This could be caused by one node crashing completely (OS crash), or
losing interconnect. For testing purposes, we've been simulating OS crashes
by suddenly resetting the server without a graceful shutdown, and simulating
loss of interconnect by using ifconfig down on both interfaces in the bond.

From what we've seen, it appears that the issue is an interaction between
the O2CB and CRS timeout values - namely that CRS is self-fencing on the
surviving nodes before O2CB has a chance to time out and recover the dead node.

By adjusting O2CB's timeout (O2CB_HEARTBEAT_THRESHOLD x 2 seconds) lower
than the CRS disktimeout value, we were able to configure the cluster so that
the "OS Crash" scenario is properly handled. The other two nodes will
survive when we completely reset the third.

However, we haven't solved the loss of interconnect scenario and believe the
problem to be similar. I'd rather not get bogged down in the specifics,
logfiles, etc. at this point in time, as we have to apply these concepts to
other environments as well, and there is still some tweaking of these
configurations to be done.

Can anyone please provide a generic conceptual overview on how the CRS and O2CB
timeout values interact in this scenario?  Does O2CB_HEARTBEAT_THRESHOLD come
into play at all when dealing with loss of interconnect, and how does it
interact with the value of O2CB_IDLE_TIMEOUT_MS? How does this correlate with
the timeouts on the CRS side of the equation?

I appreciate any help that anyone can offer.

Thanks in advance,
Tony Kolstee
Sr. Systems Engineer
Aetna




This e-mail may contain confidential or privileged information. If
you think you have received this e-mail in error, please advise the
sender by reply e-mail and then delete this e-mail immediately.
Thank you. Aetna   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101122/3ed2ac4b/attachment.html

Karim Alkhayer

2010-Nov-23 00:11 UTC

head link

[Ocfs2-users] RAC and OCFS2 timeout interaction issues

Here are some hints:

-       Starting with versions 1.2.5 and later, the ocfs2 network timeout
may be configured. 

-       If using network bonding, you should set the network idle timeout to
at least 30 seconds.

-       Set O2CB_IDLE_TIMEOUT_MS to at least 30000. If the problem persists,
set it to 60000.

 

From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Kolstee, Ronald A
(Tony)
Sent: Monday, November 22, 2010 11:57 PM
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] RAC and OCFS2 timeout interaction issues

 

 

We have a RAC cluster as follows:

*	3 nodes with RHEL5, Oracle 11.1.0.7, and OCFS2 1.4.2 
*	Voting and OCR are on OCFS2, all other shared storage is on ASM 
*	Storage hardware is provided by a fibre-channel SAN fabric 
*	Interconnect uses two bonded NICs per server, connected to different
blades on a single switch

Previously we had an issue where all three nodes would reboot if one node
had problems. This could be caused by one node crashing completely (OS
crash), or losing interconnect. For testing purposes, we've been simulating
OS crashes by suddenly resetting the server without a graceful shutdown, and
simulating loss of interconnect by using ifconfig down on both interfaces in
the bond.

 
>From what we've seen, it appears that the issue is an interaction
betweenthe O2CB and CRS timeout values - namely that CRS is self-fencing on the
surviving nodes before O2CB has a chance to time out and recover the dead
node.

 

By adjusting O2CB's timeout (O2CB_HEARTBEAT_THRESHOLD x 2 seconds) lower
than the CRS disktimeout value, we were able to configure the cluster so
that the "OS Crash" scenario is properly handled. The other two nodes
will
survive when we completely reset the third. 

 

However, we haven't solved the loss of interconnect scenario and believe the
problem to be similar. I'd rather not get bogged down in the specifics,
logfiles, etc. at this point in time, as we have to apply these concepts to
other environments as well, and there is still some tweaking of these
configurations to be done. 

 

Can anyone please provide a generic conceptual overview on how the CRS and
O2CB timeout values interact in this scenario?  Does
O2CB_HEARTBEAT_THRESHOLD come into play at all when dealing with loss of
interconnect, and how does it interact with the value of
O2CB_IDLE_TIMEOUT_MS? How does this correlate with the timeouts on the CRS
side of the equation?

 

I appreciate any help that anyone can offer.

 

Thanks in advance,

Tony Kolstee

Sr. Systems Engineer

Aetna

 

 

This e-mail may contain confidential or privileged information. If you think
you have received this e-mail in error, please advise the sender by reply
e-mail and then delete this e-mail immediately. Thank you. Aetna 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101123/ec39c730/attachment.html

onedbguru at gmail.com

2010-Nov-23 03:44 UTC

head link

[Ocfs2-users] RAC and OCFS2 timeout interaction issues

If you move to 11.2.0.1 you can put ocr, voting AND shared binaries and/or fra
in ASM/ACFS.  A much cleaner implementation.


-----Original Message-----
Date: Monday, November 22, 2010 19:12:06
To: "Kolstee, Ronald A (Tony)" <KolsteeR at aetna.com>
Cc: ocfs2-users at oss.oracle.com
From: "Karim Alkhayer" <kkhayer at gmail.com>
Subject: Re: [Ocfs2-users] RAC and OCFS2 timeout interaction issues

Here are some hints:

-       Starting with versions 1.2.5 and later, the ocfs2 network timeout
may be configured. 

-       If using network bonding, you should set the network idle timeout to
at least 30 seconds.

-       Set O2CB_IDLE_TIMEOUT_MS to at least 30000. If the problem persists,
set it to 60000.

 

From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Kolstee, Ronald A
(Tony)
Sent: Monday, November 22, 2010 11:57 PM
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] RAC and OCFS2 timeout interaction issues

 

 

We have a RAC cluster as follows:

*	3 nodes with RHEL5, Oracle 11.1.0.7, and OCFS2 1.4.2 
*	Voting and OCR are on OCFS2, all other shared storage is on ASM 
*	Storage hardware is provided by a fibre-channel SAN fabric 
*	Interconnect uses two bonded NICs per server, connected to different
blades on a single switch

Previously we had an issue where all three nodes would reboot if one node
had problems. This

Ocfs2 users - Nov 2010 - RAC and OCFS2 timeout interaction issues

[Ocfs2-users] RAC and OCFS2 timeout interaction issues

[Ocfs2-users] RAC and OCFS2 timeout interaction issues

[Ocfs2-users] RAC and OCFS2 timeout interaction issues