Uwe Ritzschke
2011-Mar-11 13:13 UTC
[Samba] Samba in Pacemaker-Cluster: CTDB fails to get recovery lock
I'm currently testing fail-over with a two-node active-active cluster (with node dig and node dag): Both nodes are up, one is manually killed. CTDB on the node that's still alive should perform a recovery and everything should working again. What's infrequently happening is: After killing the pacemaker-process on dag (and dag consequently being fenced), dig's CTDB tries to get the recovery lock and fails. As there is no other node online to get the recovery lock and thus finishing CTDB's recovery, dig's CTDB keeps trying to get the recovery lock until manually stopped. The only way to get CTDB back to work is to restart OCFS2's distributed lock manager. logfiles and pacemaker-configuration are attached, any help would be greatly appreciated :) Regards, Uwe Our setting: two nodes directly connected via LAN running openSuse 11.3 and sharing a SAN-drive that is connected via two interfaces using multipath. pacemaker 1.1.2 corosync 1.2.1 cluster-glue 1.0.5-1.4 ctdb 1.0.114-2.20 ocfs2 1.4.3-1.4 multipath 0.4.8-51.3 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: crm.config URL: <http://lists.samba.org/pipermail/samba/attachments/20110311/024ac44a/attachment.ksh> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log.ctdb URL: <http://lists.samba.org/pipermail/samba/attachments/20110311/024ac44a/attachment-0001.ksh>
Jim McDonough
2011-Mar-14 18:23 UTC
[Samba] Samba in Pacemaker-Cluster: CTDB fails to get recovery lock
On Fri, Mar 11, 2011 at 8:13 AM, Uwe Ritzschke <uwe.ritzschke.2 at cms.hu-berlin.de> wrote:> I'm currently testing fail-over with a two-node active-active cluster (with > node dig and node dag): Both nodes are up, one is manually killed. CTDB on > the node that's still alive should perform a recovery and everything should > working again. > > What's infrequently happening is: > > After killing the pacemaker-process on dag (and dag consequently being > fenced), dig's CTDB tries to get the recovery lock and fails. As there is no > other node online to get the recovery lock and thus finishing CTDB's > recovery, dig's CTDB keeps trying to get the recovery lock until manually > stopped. > The only way to get CTDB back to work is to restart OCFS2's distributed lock > manager. > > > Our setting: > > two nodes directly connected via LAN running openSuse 11.3 and sharing a > SAN-drive that is connected via two interfaces using multipath. > > pacemaker 1.1.2 > corosync 1.2.1 > cluster-glue 1.0.5-1.4 > ctdb 1.0.114-2.20 > ocfs2 1.4.3-1.4 > multipath 0.4.8-51.3 >You might want to try updated packages from the repository: http://download.opensuse.org/repositories/network:/ha-clustering/openSUSE_11.3/ This would give you newer code levels on the HA packages. -- Jim McDonough Samba Team SUSE labs jmcd at samba dot org jmcd at themcdonoughs dot org