Well, I'm not 100% sure I solved the problem in a definitve way, but
here's the complete story:
1 - install, if you can, the latest release of ocfs2 + tools. The fact that a
node reboots instead of panicking (and resting in peace until manual
intervention) is a real life saver if you do not have immediate access to the
server farm. Plus timeouts are configurable.
2 - when a cluster node is rebooted by the ocfs daemon, a telltale message is
present on the console of the node. Messages from the ocfs daemon will also be
present in /var/log/messages on the other nodes, but looking at those it is hard
to understand if the dying node was shutdown by ocfs or by other causes.
You can either sit in front of the screen or start the netdump service on the
rebooting node and the netdump-server service on a spare machine (another node
on the cluster is fine. For best results use a different nic interconnect from
the one used by ocfs.) If you are using red-hat the man pages for both services
are quite straightforward
3 - in our case, the log we netdumped said:
(6,0):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device emcpowere2
after 12000 millisecondsHeartbeat thread (6) printing last 24 blocking
operations (cur = 7):Heartbeat thread stuck at waiting for read completion,
stuffing current time into that blocker (index 7)Index 8: took 0 ms to do
submit_bio for read[ ... ]Index 7: took 9998 ms to do waiting for read
completion*** ocfs2 is very sorry to be fencing this system by restarting ***4 -
thus we determined ocfs2 was indeed at fault. Operations on other files where
ok, but using rman to create a single 1,3 GB file on the ocfs disk was somehow
triggering an heartbeat timeout.
5 - we modified the configuration of our rman scripts to try to keep the size of
the files created smaller. We tested again, and there was no reboot. I am not
sure you can achieve the same result for failovers though - the general idea is
to keep io in smaller chunks (or slow it down somehow?)
6- As Sunil recommended (sorry, I think this was off list), we also raised the
ocfs timeout value for O2CB_HEARTBEAT_THRESHOLD. Precise instructions for that
can be found here:
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
We decided to go with a value of 31. We did not raise timeouts for the network
keepalives (yet), since we are not using bonded nics for the ocfs2 interconnect.
We might do that in the future if we find out that traffic on that network is
extremely high / the network unstable, though...
Hope it helps
Gaetano
  -----Original Message-----
  From: Mattias Segerdahl [mailto:mattias.segerdahl@mandator.com]
  Sent: Friday, May 11, 2007 10:00 AM
  To: Gaetano Giunta
  Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2
  Hi,
   
  We're having the exact same problem, if we do a failover between two
filers/san's, the server reboots.
   
  So far I haven't found a solution to the problem, would you mind trying to
explain how you solved the problem, step by step?
   
  Best Regards,
   
  Mattias Segerdahl
   
  From: ocfs2-users-bounces@oss.oracle.com
[mailto:ocfs2-users-bounces@oss.oracle.com] On Behalf Of Gaetano Giunta
  Sent: den 11 maj 2007 09:47
  To: Ocfs2-users@oss.oracle.com
  Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2
   
  Thanks, but I had alreday checked out all logs I could find (oracle and crs
alerts, /var/log stuff) and there was no clear indication in there.
   
  The trick is the ocfs was sending the alert message to the console only (I
wonder why it does not also leva traces into syslog, my best guess is it tries
to shutdown as fast as it can, and sending a message to console is faster than
sending it to syslog - but I'm in no way a linux guru...).
   
  By using the netdump tool suggested by Sunil I managed to see the console
messages of the dying node (without having to phisycally be in the server farm,
which is 40 km away from my ususal workplace), and diagnosed the ocfs2 heartbeat
as "the killer".
   
  Bye
  Gaetano
    -----Original Message-----
    From: Luis Freitas [mailto:lfreitas34@yahoo.com]
    Sent: Thursday, May 10, 2007 11:17 PM
    To: Gaetano Giunta
    Cc: Ocfs2-users@oss.oracle.com
    Subject: Re: [Ocfs2-users] PBL with RMAN and ocfs2
    Gaetano,
     
        If o2cb or CRS is killing the machine, it usually shows on
/var/log/messages with lines explaining what happened. Take a look on the
/var/log/messages just before the last "syslogd x.x.x: restart".
     
    Regards,
    Luis
    Gaetano Giunta wrote:
    > Hello.
    >
    > On a 2 node RAC 10.2.0.3 setup, on RH ES 4.4 x86_64, with ocfs 1.2.5-1,
we are experiencing some troubles with RMAN: when the archive log destination is
on an ASM partition, and the backup detsination is on ocfs2, running
    >
    > backup archivelog all format
'/home/SANstorage/oracle/backup/rman/dump_log/FULL_20070509_154916/arc_%d_%u'
delete input;
    >
    > consistently causes a reboot.
    >
    > The rman catalog is clean, and has been crosschecked in every way.
    >
    > We tried on both nodes, and the node executing the backup always
reboots.
    > I am thus inclined to think that it is not the ocfs2 dlm that triggers
the reboot, because in that case the victim would always be the second node.
    >
    > I also tested the same command using as backup destination /tmp, and
all was fine. The backup file of the archived logs is 1249843712 in size.
    >
    > Our local oracle guy went through metalink and said there is no open
bug/patch for that at this time.
    >
    > Any suggestions ???
    >
    > Thanks
    > Gaetano Giunta
    >
    > 
    >
------------------------------------------------------------------------
    >
    > _______________________________________________
    > Ocfs2-users mailing list
    > Ocfs2-users@oss.oracle.com
    > http://oss.oracle.com/mailman/listinfo/ocfs2-users
    _______________________________________________
    Ocfs2-users mailing list
    Ocfs2-users@oss.oracle.com
    http://oss.oracle.com/mailman/listinfo/ocfs2-users
     
----------------------------------------------------------------------------
    Ahhh...imagining that irresistible "new car" smell?
    Check out new cars at Yahoo! Autos. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20070511/a69f8b74/attachment-0001.html
Gaetano,
   
        I am using RMAN with the default configuration here in RH 4.0, but I had
to change the I/O scheduler to the "deadline" I/O scheduler to prevent
these reboots, and increased the o2cb timeouts too. We had some just after
implementing but it seems very stable now. We increased the timeout here to 130,
to account for SAN switch failures, powerpath and such.
   
       I am still on 1.2.1 on the production nodes and it panics the machine,
which do is annoying even when the servers are on the same building, but there
are always messages on /var/log/messages of the killed node showing what
happened. Funny that 1.2.5 no longer shows these.
   
  Regards,
  Luis
  
Gaetano Giunta <giunta.gaetano@sea-aeroportimilano.it> wrote:
      v\:* {   BEHAVIOR: url(#default#VML)  }  o\:* {   BEHAVIOR:
url(#default#VML)  }  w\:* {   BEHAVIOR: url(#default#VML)  }  .shape {  
BEHAVIOR: url(#default#VML)  }      @font-face {   font-family: Cambria Math;  }
@font-face {   font-family: Calibri;  }  @font-face {   font-family: Tahoma;  } 
@page Section1 {size: 612.0pt 792.0pt; margin: 70.85pt 70.85pt 70.85pt 70.85pt;
}  P.MsoNormal {   FONT-SIZE: 12pt; MARGIN: 0cm 0cm 0pt; FONT-FAMILY:
"Times New Roman","serif"  }  LI.MsoNormal {   FONT-SIZE:
12pt; MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Times New
Roman","serif"  }  DIV.MsoNormal {   FONT-SIZE: 12pt; MARGIN: 0cm
0cm 0pt; FONT-FAMILY: "Times New Roman","serif"  }  A:link {
COLOR: blue; TEXT-DECORATION: underline; mso-style-priority: 99  } 
SPAN.MsoHyperlink {   COLOR: blue; TEXT-DECORATION: underline;
mso-style-priority: 99  }  A:visited {   COLOR: purple; TEXT-DECORATION:
underline; mso-style-priority: 99  }  SPAN.MsoHyperlinkFollowed {   COLOR:
purple;
 TEXT-DECORATION: underline; mso-style-priority: 99  }  P {   FONT-SIZE: 12pt;
MARGIN-LEFT: 0cm; MARGIN-RIGHT: 0cm; FONT-FAMILY: "Times New
Roman","serif"; mso-style-priority: 99; mso-margin-top-alt: auto;
mso-margin-bottom-alt: auto  }  SPAN.EmailStyle18 {   COLOR: #1f497d;
FONT-FAMILY: "Calibri","sans-serif"; mso-style-type:
personal-reply  }  .MsoChpDefault {   FONT-SIZE: 10pt; mso-style-type:
export-only  }  DIV.Section1 {   page: Section1  }      Well, I'm not 100%
sure I solved the problem in a definitve way, but here's the complete story:
   
  1 - install, if you can, the latest release of ocfs2 + tools. The fact that a
node reboots instead of panicking (and resting in peace until manual
intervention) is a real life saver if you do not have immediate access to the
server farm. Plus timeouts are configurable.
   
  2 - when a cluster node is rebooted by the ocfs daemon, a telltale message is
present on the console of the node. Messages from the ocfs daemon will also be
present in /var/log/messages on the other nodes, but looking at those it is hard
to understand if the dying node was shutdown by ocfs or by other causes.
   
  You can either sit in front of the screen or start the netdump service on the
rebooting node and the netdump-server service on a spare machine (another node
on the cluster is fine. For best results use a different nic interconnect from
the one used by ocfs.) If you are using red-hat the man pages for both services
are quite straightforward
   
  3 - in our case, the log we netdumped said:
(6,0):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device emcpowere2
after 12000 milliseconds
Heartbeat thread (6) printing last 24 blocking operations (cur = 7):
Heartbeat thread stuck at waiting for read completion, stuffing current time
into that blocker (index 7)
Index 8: took 0 ms to do submit_bio for read
[ ... ]
Index 7: took 9998 ms to do waiting for read completion
*** ocfs2 is very sorry to be fencing this system by restarting ***
  4 - thus we determined ocfs2 was indeed at fault. Operations on other files
where ok, but using rman to create a single 1,3 GB file on the ocfs disk was
somehow triggering an heartbeat timeout.
   
  5 - we modified the configuration of our rman scripts to try to keep the size
of the files created smaller. We tested again, and there was no reboot. I am not
sure you can achieve the same result for failovers though - the general idea is
to keep io in smaller chunks (or slow it down somehow?)
   
  6- As Sunil recommended (sorry, I think this was off list), we also raised the
ocfs timeout value for O2CB_HEARTBEAT_THRESHOLD. Precise instructions for that
can be found here:
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
We decided to go with a value of 31. We did not raise timeouts for the network
keepalives (yet), since we are not using bonded nics for the ocfs2 interconnect.
We might do that in the future if we find out that traffic on that network is
extremely high / the network unstable, though...
   
  Hope it helps
  Gaetano
   
    -----Original Message-----
From: Mattias Segerdahl [mailto:mattias.segerdahl@mandator.com]
Sent: Friday, May 11, 2007 10:00 AM
To: Gaetano Giunta
Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2
    Hi,
   
  We?re having the exact same problem, if we do a failover between two
filers/san?s, the server reboots.
   
  So far I haven?t found a solution to the problem, would you mind trying to
explain how you solved the problem, step by step?
   
  Best Regards,
   
  Mattias Segerdahl
   
        From: ocfs2-users-bounces@oss.oracle.com
[mailto:ocfs2-users-bounces@oss.oracle.com] On Behalf Of Gaetano Giunta
Sent: den 11 maj 2007 09:47
To: Ocfs2-users@oss.oracle.com
Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2
   
    Thanks, but I had alreday checked out all logs I could find (oracle and crs
alerts, /var/log stuff) and there was no clear indication in there.
     
    The trick is the ocfs was sending the alert message to the console only (I
wonder why it does not also leva traces into syslog, my best guess is it tries
to shutdown as fast as it can, and sending a message to console is faster than
sending it to syslog - but I'm in no way a linux guru...).
     
    By using the netdump tool suggested by Sunil I managed to see the console
messages of the dying node (without having to phisycally be in the server farm,
which is 40 km away from my ususal workplace), and diagnosed the ocfs2 heartbeat
as "the killer".
     
    Bye
    Gaetano
    -----Original Message-----
From: Luis Freitas [mailto:lfreitas34@yahoo.com]
Sent: Thursday, May 10, 2007 11:17 PM
To: Gaetano Giunta
Cc: Ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] PBL with RMAN and ocfs2
    Gaetano,
     
        If o2cb or CRS is killing the machine, it usually shows on
/var/log/messages with lines explaining what happened. Take a look on the
/var/log/messages just before the last "syslogd x.x.x: restart".
     
    Regards,
    Luis
    
Gaetano Giunta wrote:> Hello.
>
> On a 2 node RAC 10.2.0.3 setup, on RH ES 4.4 x86_64, with ocfs 1.2.5-1, we
are experiencing some troubles with RMAN: when the archive log destination is on
an ASM partition, and the backup detsination is on ocfs2, running
>
> backup archivelog all format
'/home/SANstorage/oracle/backup/rman/dump_log/FULL_20070509_154916/arc_%d_%u'
delete input;
>
> consistently causes a reboot.
>
> The rman catalog is clean, and has been crosschecked in every way.
>
> We tried on both nodes, and the node executing the backup always reboots.
> I am thus inclined to think that it is not the ocfs2 dlm that triggers the
reboot, because in that case the victim would always be the second node.
>
> I also tested the same command using as backup destination /tmp, and all
was fine. The backup file of the archived logs is 1249843712 in size.
>
> Our local oracle guy went through metalink and said there is no open
bug/patch for that at this time.
>
> Any suggestions ???
>
> Thanks
> Gaetano Giunta
>
> 
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
   
    
---------------------------------
  
  Ahhh...imagining that irresistible "new car" smell?
Check out new cars at Yahoo! Autos. 
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
       
---------------------------------
Pinpoint customers who are looking for what you sell. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20070511/81e8c553/attachment.html
Dear sir , 
We have used RHEL 4.0 with kernel 2.6.9-42.0.2.ELsmp with ocfs2 , we have oracle
10g 10.1.0.2
I am working on a RAC installation (10.1.0.2)on RHEL 4.0 with EMC caliriion
shared storage . The EMC device have been linked to raw devices .
We are able to configure ocfs2 and 
ASM . When we will installed CRS the error message show " oracle cluster
Registery can exists only as a shared file system file or as a shared raw
partion .
I would like request how to install the OCRFile .
Regards
Dheerendar Srivastav
Associate vice President -IT
Bajaj Capital Ltd.
New Delhi
----- Original Message -----
From: ocfs2-users-bounces@oss.oracle.com
<ocfs2-users-bounces@oss.oracle.com>
To: Ocfs2-users@oss.oracle.com <Ocfs2-users@oss.oracle.com>
Sent: Sat May 12 00:59:19 2007
Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2
Gaetano,
 
      I am using RMAN with the default configuration here in RH 4.0, but I had
to change the I/O scheduler to the "deadline" I/O scheduler to prevent
these reboots, and increased the o2cb timeouts too. We had some just after
implementing but it seems very stable now. We increased the timeout here to 130,
to account for SAN switch failures, powerpath and such.
 
     I am still on 1.2.1 on the production nodes and it panics the machine,
which do is annoying even when the servers are on the same building, but there
are always messages on /var/log/messages of the killed node showing what
happened. Funny that 1.2.5 no longer shows these.
 
Regards,
Luis
Gaetano Giunta <giunta.gaetano@sea-aeroportimilano.it> wrote:
	Well, I'm not 100% sure I solved the problem in a definitve way, but
here's the complete story:
	 
	1 - install, if you can, the latest release of ocfs2 + tools. The fact that a
node reboots instead of panicking (and resting in peace until manual
intervention) is a real life saver if you do not have immediate access to the
server farm. Plus timeouts are configurable.
	 
	2 - when a cluster node is rebooted by the ocfs daemon, a telltale message is
present on the console of the node. Messages from the ocfs daemon will also be
present in /var/log/messages on the other nodes, but looking at those it is hard
to understand if the dying node was shutdown by ocfs or by other causes.
	 
	You can either sit in front of the screen or start the netdump service on the
rebooting node and the netdump-server service on a spare machine (another node
on the cluster is fine. For best results use a different nic interconnect from
the one used by ocfs.) If you are using red-hat the man pages for both services
are quite straightforward
	 
	3 - in our case, the log we netdumped said:
		(6,0):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
emcpowere2 after 12000 milliseconds
	Heartbeat thread (6) printing last 24 blocking operations (cur = 7):
	Heartbeat thread stuck at waiting for read completion, stuffing current time
into that blocker (index 7)
	Index 8: took 0 ms to do submit_bio for read
	[ ... ]
	Index 7:
	 took 9998 ms to do waiting for read completion
	*** ocfs2 is very sorry to be fencing this system by restarting ***
	4 - thus we determined ocfs2 was indeed at fault. Operations on other files
where ok, but using rman to create a single 1,3 GB file on the ocfs disk was
somehow triggering an heartbeat timeout.
	 
	5 - we modified the configuration of our rman scripts to try to keep the size
of the files created smaller. We tested again, and there was no reboot. I am not
sure you can achieve the same result for failovers though - the general idea is
to keep io in smaller chunks (or slow it down somehow?)
	 
	6- As Sunil recommended (sorry, I think this was off list), we also raised the
ocfs timeout value for O2CB_HEARTBEAT_THRESHOLD. Precise instructions for that
can be found here:
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
We decided to go with a value of 31. We did not raise timeouts for the network
keepalives (yet), since we are not using bonded nics for the ocfs2 interconnect.
We might do that in the future if we find out that traffic on that network is
extremely high / the network unstable, though...
	 
	Hope it helps
	Gaetano
	 
	
		-----Original Message-----
		From: Mattias Segerdahl [mailto:mattias.segerdahl@mandator.com]
		Sent: Friday, May 11, 2007 10:00 AM
		To: Gaetano Giunta
		Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2
		
		
		Hi,
		 
		We?re having the exact same problem, if we do a failover between two
filers/san?s, the server reboots.
		 
		So far I haven?t found a solution to the problem, would you mind trying to
explain how you solved the problem, step by step?
		 
		Best Regards,
		 
		Mattias Segerdahl
		 
		From: ocfs2-users-bounces@oss.oracle.com
[mailto:ocfs2-users-bounces@oss.oracle.com] On Behalf Of Gaetano Giunta
		Sent: den 11 maj 2007 09:47
		To: Ocfs2-users@oss.oracle.com
		Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2
		 
		Thanks, but I had alreday checked out all logs I could find (oracle and crs
alerts, /var/log stuff) and there was no clear indication in there.
		 
		The trick is the ocfs was sending the alert message to the console only (I
wonder why it does not also leva traces into syslog, my best guess is it tries
to shutdown as fast as it can, and sending a message to console is faster than
sending it to syslog - but I'm in no way a linux guru...).
		 
		By using the netdump tool suggested by Sunil I managed to see the console
messages of the dying node (without having to phisycally be in the server farm,
which is 40 km away from my ususal workplace), and diagnosed the ocfs2 heartbeat
as "the killer".
		 
		Bye
		Gaetano
			-----Original Message-----
			From: Luis Freitas [mailto:lfreitas34@yahoo.com]
			Sent: Thursday, May 10, 2007 11:17 PM
			To: Gaetano Giunta
			Cc: Ocfs2-users@oss.oracle.com
			Subject: Re: [Ocfs2-users] PBL with RMAN and ocfs2
			Gaetano,
			 
			    If o2cb or CRS is killing the machine, it usually shows on
/var/log/messages with lines explaining what happened. Take a look on the
/var/log/messages just before the last "syslogd x.x.x: restart".
			 
			Regards,
			Luis
			Gaetano Giunta wrote:
			> Hello.
			>
			> On a 2 node RAC 10.2.0.3 setup, on RH ES 4.4 x86_64, with ocfs 1.2.5-1,
we are experiencing some troubles with RMAN: when the archive log destination is
on an ASM partition, and the backup detsination is on ocfs2, running
			>
			> backup archivelog all format
'/home/SANstorage/oracle/backup/rman/dump_log/FULL_20070509_154916/arc_%d_%u'
delete input;
			>
			> consistently causes a reboot.
			>
			> The rman catalog is clean, and has been crosschecked in every way.
			>
			> We tried on both nodes, and the node executing the backup always
reboots.
			> I am thus inclined to think that it is not the ocfs2 dlm that triggers
the reboot, because in that case the victim would always be the second node.
			>
			> I also tested the same command using as backup destination /tmp, and all
was fine. The backup file of the archived logs is 1249843712 in size.
			>
			> Our local oracle guy went through metalink and said there is no open
bug/patch for that at this time.
			>
			> Any suggestions ???
			>
			> Thanks
			> Gaetano Giunta
			>
			> 
			> ------------------------------------------------------------------------
			>
			> _______________________________________________
			> Ocfs2-users mailing list
			> Ocfs2-users@oss.oracle.com
			> http://oss.oracle.com/mailman/listinfo/ocfs2-users
			
			
			_______________________________________________
			Ocfs2-users mailing list
			Ocfs2-users@oss.oracle.com
			http://oss.oracle.com/mailman/listinfo/ocfs2-users
			 
________________________________
			Ahhh...imagining that irresistible "new car" smell?
			Check out new cars at Yahoo! Autos.
<http://us.rd.yahoo.com/evt=48245/*http:/autos.yahoo.com/new_cars.html;_ylc=X3oDMTE1YW1jcXJ2BF9TAzk3MTA3MDc2BHNlYwNtYWlsdGFncwRzbGsDbmV3LWNhcnM->
	_______________________________________________
	Ocfs2-users mailing list
	Ocfs2-users@oss.oracle.com
	http://oss.oracle.com/mailman/listinfo/ocfs2-users
________________________________
Pinpoint customers
<http://us.rd.yahoo.com/evt=48250/*http://searchmarketing.yahoo.com/arp/sponsoredsearch_v9.php?o=US2226&cmp=Yahoo&ctv=AprNI&s=Y&s2=EM&b=50>
who are looking for what you sell.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20070513/68165eb0/attachment-0001.html
Ensure you have mounted the volume storing the ocr and the voting diskfile with the "nointr,datavolume" mount options. http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#RAC Dheerendar Srivastav wrote:> > Dear sir , > > We have used RHEL 4.0 with kernel 2.6.9-42.0.2.ELsmp with ocfs2 , we > have oracle 10g 10.1.0.2 > > I am working on a RAC installation (10.1.0.2)on RHEL 4.0 with EMC > caliriion shared storage . The EMC device have been linked to raw > devices . > > We are able to configure ocfs2 and > ASM . When we will installed CRS the error message show " oracle > cluster Registery can exists only as a shared file system file or as a > shared raw partion . > > I would like request how to install the OCRFile . >