Karim Alkhayer
2009-Jan-26 17:42 UTC
[Ocfs2-users] How to force node [a] to consider node [b] dead?
Hi All, We have O2CB_HEARTBEAT_THRESHOLD set to 601 as the SAN gets overloaded sometimes and hence causing the nodes to panic This value has proven to be more stable than 31. However, there are sometimes where one of the nodes, for instance node [b] crashes, for whatever reason. While attempting to startup the troublesome node, auto mount is enabled but doesn't succeed, "Transport endpoint is not connected" is usually displayed. My opinion is this: the mount doesn't succeed because node [a] still thinks that node [b] is alive We're talking about a restart that can take around 15 minutes, so basically, the threshold is passed I was wondering if there is a workaround to kick node [b] out of the cluster so that it can join it again. What I've done so far, the incident happened once - a month ago, is to restart the cluster services on both machines. This was very expensive solution as all database instances had to go down OCFS2 1.2.1, SLES9 SP3 2.6.5-7.257-default, RAC 10.1.0.5, 5 DBs Thanks Karim -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090126/4e4637d1/attachment.html
Sunil Mushran
2009-Jan-26 17:52 UTC
[Ocfs2-users] How to force node [a] to consider node [b] dead?
You are running a 3 year old version of the fs. Please upgrade to something more current. Like sles9 sp4 or sles10 sp1 that bundles ocfs2 1.2.9, or sles10 sp2 that ships ocfs2 1.4.1. Karim Alkhayer wrote:> > Hi All, > > We have O2CB_HEARTBEAT_THRESHOLD set to 601 as the SAN gets overloaded > sometimes and hence causing the nodes to panic > > This value has proven to be more stable than 31. However, there are > sometimes where one of the nodes, for instance node [b] crashes, for > whatever reason. While attempting to startup the troublesome node, > auto mount is enabled but doesn?t succeed, ?Transport endpoint is not > connected? is usually displayed. > > My opinion is this: the mount doesn?t succeed because node [a] still > thinks that node [b] is alive > > We?re talking about a restart that can take around 15 minutes, so > basically, the threshold is passed > > I was wondering if there is a workaround to kick node [b] out of the > cluster so that it can join it again. What I?ve done so far, the > incident happened once - a month ago, is to restart the cluster > services on both machines. This was very expensive solution as all > database instances had to go down > > OCFS2 1.2.1, SLES9 SP3 2.6.5-7.257-default, RAC 10.1.0.5, 5 DBs > > Thanks > > Karim > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users