Christian Kuntz
2020-May-06  20:16 UTC
[Samba] Nodes in CTDB Cluster don't release recovery lock
Hello all, First of all, apologies if this isn't the right location for this question, I couldn't find a CTDB specific mailing list or IRC so I figured the general one would be appropriate. Please let me know if this question is better placed elsewhere. I'm trying to test clustered samba and have a two node CTDB setup (Following the guide here: https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I can't seem to get the current setup in a totally healthy state. If there is no recovery lock enabled, then the two nodes think they are healthy and the other is dead, and if there is a recovery lock enabled, one node seizes it and never releases it. I'm running Debian Buster, with Lustre+ZFS as the clustered file system for the recovery lock, and using upstream's Samba and CTDB. The lustre clients are mounted with flock, and I've confirmed the lock is being held with `lslock`. Here is the debug information from the "healthy node" (the unhealthy one just says it can't take the lock as it is under contention, so I thought it would be of little use): https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharing Best regards and thank you all for your time, Christian -- <https://opendrives.com/wp-content/uploads/2020/04/OD-Anywhere.pdf>
Martin Schwenke
2020-May-07  07:05 UTC
[Samba] Nodes in CTDB Cluster don't release recovery lock
Hi Christian, On Wed, 6 May 2020 13:16:36 -0700, Christian Kuntz via samba <samba at lists.samba.org> wrote:> First of all, apologies if this isn't the right location for this question, > I couldn't find a CTDB specific mailing list or IRC so I figured the > general one would be appropriate. Please let me know if this question is > better placed elsewhere.This is the right place. Thanks for asking! :-)> I'm trying to test clustered samba and have a two node CTDB setup > (Following the guide here: > https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I can't seem > to get the current setup in a totally healthy state. If there is no > recovery lock enabled, then the two nodes think they are healthy and the > other is dead, and if there is a recovery lock enabled, one node seizes it > and never releases it. > > I'm running Debian Buster, with Lustre+ZFS as the clustered file system for > the recovery lock, and using upstream's Samba and CTDB. The lustre clients > are mounted with flock, and I've confirmed the lock is being held with > `lslock`. > > Here is the debug information from the "healthy node" (the unhealthy one > just says it can't take the lock as it is under contention, so I thought it > would be of little use): > https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharingYour main problem seems to be that the nodes are not connecting to each other: $ ls ctdb*.log ctdb-full.log ctdb-locked-out-node.log ctdb.log $ grep "connected to" ctdb*.log $ You should see lines something similar to this: 2020/05/07 16:58:06.186548 node.2 ctdbd[1351213]: 127.0.0.3:4379: connected to 127.0.0.2:4379 - 2 connected Lack of connectivity would also explain the 2nd node trying to take the recovery lock. It will elect itself leader of its own cluster and try to take the lock. Luckily, the nodes are communicating at the filesystem level, so the lock can't be taken and the 2nd node can not proceed with database recovery... so split brain is avoided. Is there a firewall blocking TCP port 4379? peace & happiness, martin
Christian Kuntz
2020-May-07  18:57 UTC
[Samba] Nodes in CTDB Cluster don't release recovery lock
Hello all, Thanks for the input. I opened up the firewall ports and testing connectivity by ctdb's ping to no avail. I did however fix the problem. I must have missed the section of the guide outlining the importance of the nodes file, but it seems the issue was that machine A's nodes file was in reverse order compared to B's. After rectifying that the cluster came up without issue, and the issue has been resolved. Thanks everyone for your time! Christian On Thu, May 7, 2020 at 12:06 AM Martin Schwenke <martin at meltin.net> wrote:> Hi Christian, > > On Wed, 6 May 2020 13:16:36 -0700, Christian Kuntz via samba > <samba at lists.samba.org> wrote: > > > First of all, apologies if this isn't the right location for this > question, > > I couldn't find a CTDB specific mailing list or IRC so I figured the > > general one would be appropriate. Please let me know if this question is > > better placed elsewhere. > > This is the right place. Thanks for asking! :-) > > > I'm trying to test clustered samba and have a two node CTDB setup > > (Following the guide here: > > https://wiki.samba.org/index.php/CTDB_and_Clustered_Samba) and I can't > seem > > to get the current setup in a totally healthy state. If there is no > > recovery lock enabled, then the two nodes think they are healthy and the > > other is dead, and if there is a recovery lock enabled, one node seizes > it > > and never releases it. > > > > I'm running Debian Buster, with Lustre+ZFS as the clustered file system > for > > the recovery lock, and using upstream's Samba and CTDB. The lustre > clients > > are mounted with flock, and I've confirmed the lock is being held with > > `lslock`. > > > > Here is the debug information from the "healthy node" (the unhealthy one > > just says it can't take the lock as it is under contention, so I thought > it > > would be of little use): > > > https://drive.google.com/drive/folders/1yzFdwGKfDAHiVbsj7pcOe-jX4KHpFO2U?usp=sharing > > Your main problem seems to be that the nodes are not connecting to each > other: > > $ ls ctdb*.log > ctdb-full.log ctdb-locked-out-node.log ctdb.log > $ grep "connected to" ctdb*.log > $ > > You should see lines something similar to this: > > 2020/05/07 16:58:06.186548 node.2 ctdbd[1351213]: 127.0.0.3:4379: > connected to 127.0.0.2:4379 - 2 connected > > Lack of connectivity would also explain the 2nd node trying to take the > recovery lock. It will elect itself leader of its own cluster and try > to take the lock. Luckily, the nodes are communicating at the > filesystem level, so the lock can't be taken and the 2nd node can not > proceed with database recovery... so split brain is avoided. > > Is there a firewall blocking TCP port 4379? > > peace & happiness, > martin >-- <https://opendrives.com/wp-content/uploads/2020/04/OD-Anywhere.pdf>