Unfortunately, it MAKES CLUSTER LESS STABLE. It works until network and SAN
systems afe fine, but is not so good in failed situations.
Even if we use OCFSv2 for idle file systems (which do nothing 90% of the
time) , o2cb reboots nodes when lost heartbeat
or (worst) network or (even worst) both... Instead of trying to recover
without it (as I said 0- FS is in consistant state,
no activity at all).
It is not just OCFSv2 problem - Oracle CSS behave simular (butis much more
stable in reality), and Linux HA cluster
too (but it can use different heartbeat conenctions so it can be configured
very reliable).
You are right saying that _cluster software always have a tendency to fence
or kill neighbours to keep
internal consistancy_. But OCFSv2 is one of he worst examples of such
software.
What can be done _relatively easy_.
(1) as we saiud many times - redundancy and better timeout control in
heartbeat. (Of course, long timeouts means _long recovery_, but it's OK for
90%
installations). Typical network recovery is 1 minute, not 10 seconds.
(2) System should not make bad things IF it is in consistant state. In many
cases, if system have not outstanding IO requests, it can recover
without server reboot (or at least try to do it) even if it lost heartbeats
and suspect, that other systems could take control out of it.
It is serious theoretical challenge _how to do it safely_, but it is very
desired for such systems.
(3) In some configurations, FS can be treated as _not so important_. It
means that it is safer to switch into red_only and try to recover online,
but not panic. Good example - you have production Oracle which uses ASM, and
you use OCFSv2 for backup storage. IT is safer to make IOP failure on this
storage vs rebooting system without reasons.
PS. I had 2 network outages in the lab today,m because of bad UPS - and in
all cases, ALL OCFSv2 servers (in 2 different clusters) rebooted. No one
survived short (30 seconds) lost of Ethernet conenction (including iSCSI).
In some cases, one server rebooted by OCFS and otehr by another part of the
cluster (HA or RAC) - but result is exactly this - _all_ OCFSv2 panic on a
shport network/san outage, in all cases.
----- Original Message -----
From: "Sunil Mushran" <Sunil.Mushran@oracle.com>
To: "ocfs2-users" <ocfs2-users@oss.oracle.com>
Sent: Tuesday, October 03, 2006 1:51 PM
Subject: [Ocfs2-users] Re: FW: Use of OCFS2 file systems.
> I try to avoid responding to such emails because I am not sure how
> much credibility a partisan has in such debates. After all I have been
> working on OCFS/OCFS2 the last 4/5 years.
>
> Having said that, I have some issues with the statements. While it is true
> that we can improve on the disk/net heartbeat, it is wrong to say that it
> does not work or makes the cluster unstable.
>
> We have OCFS2 running on lots of clusters in Oracle that are testing each
> new revision of the database. While these machines are test boxes, they
are> all running loads designed to break Oracle. I am rarely pinged about them
> hitting an OCFS2 issue.
>
> We also have internal production databases as well as Oracle customers who
> are using OCFS2 with much success.
>
> However, we do have room for improvement and we are working on it.
>
> For the list of ongoing projects, you can peruse the OCFS2 Development
> Wiki at http://oss.oracle.com/osswiki/OCFS2.
>
> If you wish to contribute code, as this is an open source project, feel
free> to ping me or the ocfs2-devel@oss.oracle.com mailing list.
>
> Thanks
> Sunil Mushran
>
> >
> > Hi Sunial,
> >
> > What are your thoughts about this message on the mailing lists?
> >
> > Thanks!
> > Sanjeet
> >
> >
> >
------------------------------------------------------------------------
> >
> > *From:* ocfs2-users-bounces@oss.oracle.com
> > [mailto:ocfs2-users-bounces@oss.oracle.com] *On Behalf Of
*Alexei_Roudnev> > *Sent:* Friday, September 29, 2006 11:50 PM
> > *To:* Bill Wells; Sunil Mushran
> > *Cc:* ocfs2-users@oss.oracle.com
> > *Subject:* Re: [Ocfs2-users] Use of OCFS2 file systems.
> >
> >
> >
> > If you can avoid OCFSv2 on a RAC server, better do it. Any cluster
> > (RAC and OCFS) have it's own instability elements (OCFSv2 have a
poor
> > heartbeat alghoritm and so tend to self-fence without real failure,
> > and (in addition) is relatively new. It works fine enough to be used,
> > when you really need file sharing (such as database files or backups
> > or even archive logs), but the less you use it, the better. Oracle
> > home files feels well without sharing.
> >
> >
> >
> > // I don't see problems with OCFSv2 on SLES9 SP3-updated, but I
avoid
> > to use it for mission critical file systems or heavy-duty file
systems,
> >
> > // and I still have failure scenario, when RAC cluster could work but
> > OCFS cause full-cluster failure
> >
> > // If you have network problem, SAN
> >
> > // system restart, disk io error, etc etc - you can end up with system
> > panic or reboot, caused by OCFS -
> >
> > // so the less OCFS you have, the better is your system stability.
> >
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>