Galan Merchan, Martin
2006-Oct-04 00:37 UTC
[Ocfs2-users] Re: FW: Use of OCFS2 file systems.
Hello, I'm working with OCFS2 on Radhat Advanced Server 4 Patch 3 and I had kernel panics too. I use OCFS2 only for RAC archive logs and RMAN backups. Well, I'm testing one solution and seems to be fine: In /etc/ocfs2/cluster.conf I have replaced the public IPs by the heartbeat IPs (parameter ip_address), but keeping the names. Is there anyone that knows this solution and have tested it with fails? Regards from Spain, MART?N -----Mensaje original----- De: ocfs2-users-bounces@oss.oracle.com [mailto:ocfs2-users-bounces@oss.oracle.com] En nombre de Alexei_Roudnev Enviado el: mi?rcoles, 04 de octubre de 2006 0:49 Para: Sunil Mushran; ocfs2-users Asunto: Re: [Ocfs2-users] Re: FW: Use of OCFS2 file systems. Unfortunately, it MAKES CLUSTER LESS STABLE. It works until network and SAN systems afe fine, but is not so good in failed situations. Even if we use OCFSv2 for idle file systems (which do nothing 90% of the time) , o2cb reboots nodes when lost heartbeat or (worst) network or (even worst) both... Instead of trying to recover without it (as I said 0- FS is in consistant state, no activity at all). It is not just OCFSv2 problem - Oracle CSS behave simular (butis much more stable in reality), and Linux HA cluster too (but it can use different heartbeat conenctions so it can be configured very reliable). You are right saying that _cluster software always have a tendency to fence or kill neighbours to keep internal consistancy_. But OCFSv2 is one of he worst examples of such software. What can be done _relatively easy_. (1) as we saiud many times - redundancy and better timeout control in heartbeat. (Of course, long timeouts means _long recovery_, but it's OK for 90% installations). Typical network recovery is 1 minute, not 10 seconds. (2) System should not make bad things IF it is in consistant state. In many cases, if system have not outstanding IO requests, it can recover without server reboot (or at least try to do it) even if it lost heartbeats and suspect, that other systems could take control out of it. It is serious theoretical challenge _how to do it safely_, but it is very desired for such systems. (3) In some configurations, FS can be treated as _not so important_. It means that it is safer to switch into red_only and try to recover online, but not panic. Good example - you have production Oracle which uses ASM, and you use OCFSv2 for backup storage. IT is safer to make IOP failure on this storage vs rebooting system without reasons. PS. I had 2 network outages in the lab today,m because of bad UPS - and in all cases, ALL OCFSv2 servers (in 2 different clusters) rebooted. No one survived short (30 seconds) lost of Ethernet conenction (including iSCSI). In some cases, one server rebooted by OCFS and otehr by another part of the cluster (HA or RAC) - but result is exactly this - _all_ OCFSv2 panic on a shport network/san outage, in all cases. ----- Original Message ----- From: "Sunil Mushran" <Sunil.Mushran@oracle.com> To: "ocfs2-users" <ocfs2-users@oss.oracle.com> Sent: Tuesday, October 03, 2006 1:51 PM Subject: [Ocfs2-users] Re: FW: Use of OCFS2 file systems.> I try to avoid responding to such emails because I am not sure how> much credibility a partisan has in such debates. After all I have been> working on OCFS/OCFS2 the last 4/5 years.>> Having said that, I have some issues with the statements. While it is true> that we can improve on the disk/net heartbeat, it is wrong to say that it> does not work or makes the cluster unstable.>> We have OCFS2 running on lots of clusters in Oracle that are testing each> new revision of the database. While these machines are test boxes, theyare> all running loads designed to break Oracle. I am rarely pinged about them> hitting an OCFS2 issue.>> We also have internal production databases as well as Oracle customers who> are using OCFS2 with much success.>> However, we do have room for improvement and we are working on it.>> For the list of ongoing projects, you can peruse the OCFS2 Development> Wiki at http://oss.oracle.com/osswiki/OCFS2.>> If you wish to contribute code, as this is an open source project, feelfree> to ping me or the ocfs2-devel@oss.oracle.com mailing list.>> Thanks> Sunil Mushran>> >> > Hi Sunial,> >> > What are your thoughts about this message on the mailing lists?> >> > Thanks!> > Sanjeet> >> >> > ------------------------------------------------------------------------> >> > *From:* ocfs2-users-bounces@oss.oracle.com> > [mailto:ocfs2-users-bounces@oss.oracle.com] *On Behalf Of*Alexei_Roudnev> > *Sent:* Friday, September 29, 2006 11:50 PM> > *To:* Bill Wells; Sunil Mushran> > *Cc:* ocfs2-users@oss.oracle.com> > *Subject:* Re: [Ocfs2-users] Use of OCFS2 file systems.> >> >> >> > If you can avoid OCFSv2 on a RAC server, better do it. Any cluster> > (RAC and OCFS) have it's own instability elements (OCFSv2 have a poor> > heartbeat alghoritm and so tend to self-fence without real failure,> > and (in addition) is relatively new. It works fine enough to be used,> > when you really need file sharing (such as database files or backups> > or even archive logs), but the less you use it, the better. Oracle> > home files feels well without sharing.> >> >> >> > // I don't see problems with OCFSv2 on SLES9 SP3-updated, but I avoid> > to use it for mission critical file systems or heavy-duty file systems,> >> > // and I still have failure scenario, when RAC cluster could work but> > OCFS cause full-cluster failure> >> > // If you have network problem, SAN> >> > // system restart, disk io error, etc etc - you can end up with system> > panic or reboot, caused by OCFS -> >> > // so the less OCFS you have, the better is your system stability.> >>> _______________________________________________> Ocfs2-users mailing list> Ocfs2-users@oss.oracle.com> http://oss.oracle.com/mailman/listinfo/ocfs2-users>_______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users This e-mail may contain confidential or privileged information. Any unauthorised copying, use or distribution of this information is strictly prohibited. Este mensaje electr?nico puede contener informaci?n confidencial o privilegiada, por lo que est? completamente prohibida la copia, el uso o la distribuci?n no autorizada de dicha informaci?n Aquest missatge electr?nic pot contenir informaci? confidencial o privilegiada i est? completament prohibida qualsevol c?pia, ?s o distribuci? no autoritzada d'aquesta informaci?. Mezu honek, enpresaren jabetzapeko edo legalki babestutako isilpeko informazioa izan dezake. Zu ez baldin bazara hartzailea, mesedez bidaltzaileari jakinarazi iezaiozu eta mezua ezabatu, ez ezazu gorde ezta birbidali ere, baimendu gabeko bere erabilera debekatzen da eta. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20061004/ceb5589b/attachment-0001.html
File a bug on bugzilla (oss.oracle.com/bugzilla) with the full oops trace and any other information that seems relevant. Galan Merchan, Martin wrote:> Hello, > > I?m working with OCFS2 on Radhat Advanced Server 4 Patch 3 and I had > kernel panics too. I use OCFS2 only for RAC archive logs and RMAN backups. > > Well, I?m testing one solution and seems to be fine: > > In /etc/ocfs2/cluster.conf I have replaced the public IPs by the > heartbeat IPs (parameter ip_address), but keeping the names. > > Is there anyone that knows this solution and have tested it with fails? > > Regards from Spain, > > *_MART?N_* >