thr3ads.net - samba - [Samba] Nodes in CTDB Cluster don't release recovery lock [May 2020]

If this information is useful, please help other people find it:
Share via:

Christian Kuntz

2020-May-06 20:16 UTC

[Samba] Nodes in CTDB Cluster don't release recovery lock

Hello all,

First of all, apologies if this isn't the right location for this question,
I couldn't find a CTDB specific mailing list or IRC so I figured the
general one would be appropriate. Please let me know if this question is
better placed elsewhere.

I'm trying to test clustered samba and have a two node CTDB setup
(Following the guide here:
https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I can't seem
to get the current setup in a totally healthy state. If there is no
recovery lock enabled, then the two nodes think they are healthy and the
other is dead, and if there is a recovery lock enabled, one node seizes it
and never releases it.

I'm running Debian Buster, with Lustre+ZFS as the clustered file system for
the recovery lock, and using upstream's Samba and CTDB. The lustre clients
are mounted with flock, and I've confirmed the lock is being held with
`lslock`.

Here is the debug information from the "healthy node" (the unhealthy
one
just says it can't take the lock as it is under contention, so I thought it
would be of little use):
https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharing


Best regards and thank you all for your time,
Christian

-- 
 <https://opendrives.com/wp-content/uploads/2020/04/OD-Anywhere.pdf>

Martin Schwenke

2020-May-07 07:05 UTC

head link

[Samba] Nodes in CTDB Cluster don't release recovery lock

Hi Christian,

On Wed, 6 May 2020 13:16:36 -0700, Christian Kuntz via samba
<samba at lists.samba.org> wrote:
> First of all, apologies if this isn't the right location for this
question,
> I couldn't find a CTDB specific mailing list or IRC so I figured the
> general one would be appropriate. Please let me know if this question is
> better placed elsewhere.
This is the right place.  Thanks for asking!  :-)
> I'm trying to test clustered samba and have a two node CTDB setup
> (Following the guide here:
> https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I can't
seem
> to get the current setup in a totally healthy state. If there is no
> recovery lock enabled, then the two nodes think they are healthy and the
> other is dead, and if there is a recovery lock enabled, one node seizes it
> and never releases it.
> 
> I'm running Debian Buster, with Lustre+ZFS as the clustered file system
for
> the recovery lock, and using upstream's Samba and CTDB. The lustre
clients
> are mounted with flock, and I've confirmed the lock is being held with
> `lslock`.
> 
> Here is the debug information from the "healthy node" (the
unhealthy one
> just says it can't take the lock as it is under contention, so I
thought it
> would be of little use):
>
https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharing
Your main problem seems to be that the nodes are not connecting to each
other:

  $ ls ctdb*.log
  ctdb-full.log  ctdb-locked-out-node.log  ctdb.log
  $ grep "connected to" ctdb*.log
  $

You should see lines something similar to this:

  2020/05/07 16:58:06.186548 node.2 ctdbd[1351213]: 127.0.0.3:4379: connected to
127.0.0.2:4379 - 2 connected

Lack of connectivity would also explain the 2nd node trying to take the
recovery lock. It will elect itself leader of its own cluster and try
to take the lock.  Luckily, the nodes are communicating at the
filesystem level, so the lock can't be taken and the 2nd node can not
proceed with database recovery... so split brain is avoided.

Is there a firewall blocking TCP port 4379?

peace & happiness,
martin

Christian Kuntz

2020-May-07 18:57 UTC

head link

[Samba] Nodes in CTDB Cluster don't release recovery lock

Hello all,

Thanks for the input. I opened up the firewall ports and testing
connectivity by ctdb's ping to no avail.

I did however fix the problem. I must have missed the section of the guide
outlining the importance of the nodes file, but it seems the issue was that
machine A's nodes file was in reverse order compared to B's. After
rectifying that the cluster came up without issue, and the issue has been
resolved.

Thanks everyone for your time!
Christian

On Thu, May 7, 2020 at 12:06 AM Martin Schwenke <martin at meltin.net>
wrote:
> Hi Christian,
>
> On Wed, 6 May 2020 13:16:36 -0700, Christian Kuntz via samba
> <samba at lists.samba.org> wrote:
>
> > First of all, apologies if this isn't the right location for this
> question,
> > I couldn't find a CTDB specific mailing list or IRC so I figured
the
> > general one would be appropriate. Please let me know if this question
is
> > better placed elsewhere.
>
> This is the right place.  Thanks for asking!  :-)
>
> > I'm trying to test clustered samba and have a two node CTDB setup
> > (Following the guide here:
> > https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I
can't
> seem
> > to get the current setup in a totally healthy state. If there is no
> > recovery lock enabled, then the two nodes think they are healthy and
the
> > other is dead, and if there is a recovery lock enabled, one node
seizes
> it
> > and never releases it.
> >
> > I'm running Debian Buster, with Lustre+ZFS as the clustered file
system
> for
> > the recovery lock, and using upstream's Samba and CTDB. The lustre
> clients
> > are mounted with flock, and I've confirmed the lock is being held
with
> > `lslock`.
> >
> > Here is the debug information from the "healthy node" (the
unhealthy one
> > just says it can't take the lock as it is under contention, so I
thought
> it
> > would be of little use):
> >
>
https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharing
>
> Your main problem seems to be that the nodes are not connecting to each
> other:
>
>   $ ls ctdb*.log
>   ctdb-full.log  ctdb-locked-out-node.log  ctdb.log
>   $ grep "connected to" ctdb*.log
>   $
>
> You should see lines something similar to this:
>
>   2020/05/07 16:58:06.186548 node.2 ctdbd[1351213]: 127.0.0.3:4379:
> connected to 127.0.0.2:4379 - 2 connected
>
> Lack of connectivity would also explain the 2nd node trying to take the
> recovery lock. It will elect itself leader of its own cluster and try
> to take the lock.  Luckily, the nodes are communicating at the
> filesystem level, so the lock can't be taken and the 2nd node can not
> proceed with database recovery... so split brain is avoided.
>
> Is there a firewall blocking TCP port 4379?
>
> peace & happiness,
> martin
>
-- 
 <https://opendrives.com/wp-content/uploads/2020/04/OD-Anywhere.pdf>

Reasonably Related Threads

Search for more seemingly similar threads

samba - May 2020 - Nodes in CTDB Cluster don't release recovery lock

[Samba] Nodes in CTDB Cluster don't release recovery lock

[Samba] Nodes in CTDB Cluster don't release recovery lock

[Samba] Nodes in CTDB Cluster don't release recovery lock

Reasonably Related Threads