Martin Schwenke
2023-Nov-28 02:22 UTC
[Samba] CTDB: some problems about disconnecting the private network of ctdb leader nodes
Hi, On Sun, 26 Nov 2023 21:13:21 +0800, "tu.qiuping via samba" <samba at lists.samba.org> wrote:> My ctdb version is 4.17.7 > > > My ctdb cluster configuration is correct and the cluster is healthy before operation. > > My cluster has three nodes, namely host-192-168-34-164, > host-192-168-34-165, and host-192-168-34-166. And the node > host-192-168-34-164 is the leader before operation. > > > I conducted network oscillation testing on node > host-192-168-34-164?I down the interface of private network of ctdb > at 19:18:54.091439. Then this node starts to do recovery. What I am > puzzled about is that at 19:18:59.822903, this node timed out > obtaining a lock, and the log shows ?Time out getting recovery lock, > allowing recovery mode set any way??and then host-192-168-34-164 > takeover all the virtual ip. > > > I checked the source code of ctdb and found that lines 578 to 582 of > the file samba/ctdb/server/ctdb_recover. c state: Timeout. > Consider this a success, not a failure, as we failed to set the > recovery lock which is what we wanted. This can be caused by > the cluster filesystem being very slow to arbitrate locks immediately > after a node failure. > > > I am puzzled why get the reclock timeout is considered successful. > Although a slow cluster file system may cause get reclock timeout, > disconnecting the private network of the leader node can also cause > this situation. Therefore, this disconnected node will take over all > virtual IPs, which will conflict with the virtual IPs of other normal > nodes?So, is it inappropriate to assume that get the reclock timeout > is successful in this situation?It is considered successful because it wasn't able to take the lock in a conflicting way. This part of the code doesn't have anything to do with selecting the leader node. It is a sanity check at the end of recovery to ensure that cluster lock can't be taken by nodes/processes that should not be able to take it. Given the way this lock is now used to determine the leader, this check is definitely in the wrong place, since this is a cluster management issue, not a recovery issue. If the leader becomes disconnected then the cluster lock will need to be released (e.g. by the underlying filesystem) before another node can be leader. 2 nodes should never be leader at the same time.> The logs of the three nodes are attached.I think mailman might have stripped the logs... :-( I think I need to see the logs before I can say anything very useful... peace & happiness, martin
tu.qiuping
2023-Nov-28 15:18 UTC
[Samba] 回复: CTDB: some problems about disconnecting the private network of ctdb leader nodes
Thank you for your reply, Martin. As you said, "If the leader becomes disconnected then the cluster lock will need to be released (e.g. by the underlying filesystem) before another node can be leader. 2 nodes should never be leader at the same time.", so what I want to ask is, how does the leader node know it has a problem when the network is abnormal. From the log, it thinks that other nodes are experiencing network anomalies and have taken over all virtual IPs. Or is it true that cluster management is incomplete in this situation. I have another question, which is why there is no log of renewing reclock failure for such a long time after disconnecting the network of the leader node, as the duration of the lock I set is 2 seconds What I think is that it should renew reclock failed and kept trying instead of taking over all virtual IPs first , after the leader node's network is abnormal. Here are the logs for three nodes? host-192-168-34-164: 2023-11-23T19:18:54.091287+08:00 host-192-168-34-164 ctdbd[6708]: dead count reached for node 1 2023-11-23T19:18:54.091331+08:00 host-192-168-34-164 ctdbd[6708]: Tearing down connection to dead node :1 2023-11-23T19:18:54.091378+08:00 host-192-168-34-164 ctdbd[6708]: 100.100.34.164:4379: node 100.100.34.165:4379 is dead: 1 connected 2023-11-23T19:18:54.091411+08:00 host-192-168-34-164 ctdbd[6708]: dead count reached for node 2 2023-11-23T19:18:54.091416+08:00 host-192-168-34-164 ctdbd[6708]: Tearing down connection to dead node :2 2023-11-23T19:18:54.091439+08:00 host-192-168-34-164 ctdbd[6708]: 100.100.34.164:4379: node 100.100.34.166:4379 is dead: 0 connected 2023-11-23T19:18:54.091431+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.091460+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.091475+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async operation failed with ret=-1 res=-1 opcode=80 2023-11-23T19:18:54.091491+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.091498+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.091504+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async operation failed with ret=-1 res=-1 opcode=80 2023-11-23T19:18:54.091523+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async wait failed - fail_count=2 2023-11-23T19:18:54.091532+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-11-23T19:18:54.091538+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities. 2023-11-23T19:18:54.091543+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities. 2023-11-23T19:18:54.093043+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:2782 The vnnmap count is different from the number of active lmaster nodes: 3 vs 1 2023-11-23T19:18:54.093065+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1314 Starting do_recovery 2023-11-23T19:18:54.093070+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1358 Recovery initiated due to problem with node 0 2023-11-23T19:18:54.093108+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1388 Recovery - updated flags 2023-11-23T19:18:54.127260+08:00 host-192-168-34-164 ctdbd[6708]: Recovery mode set to ACTIVE 2023-11-23T19:18:54.127318+08:00 host-192-168-34-164 ctdb-recovery[3208978]: Set recovery mode to ACTIVE 2023-11-23T19:18:54.127390+08:00 host-192-168-34-164 ctdbd[6708]: Recovery has started 2023-11-23T19:18:54.175221+08:00 host-192-168-34-164 ctdb-recovery[3208978]: start_recovery event finished 2023-11-23T19:18:54.175289+08:00 host-192-168-34-164 ctdb-recovery[3208978]: updated VNNMAP 2023-11-23T19:18:54.175302+08:00 host-192-168-34-164 ctdb-recovery[3208978]: recover database 0x6645c6c4 2023-11-23T19:18:54.175369+08:00 host-192-168-34-164 ctdbd[6708]: Freeze db: ctdb.tdb 2023-11-23T19:18:54.820651+08:00 host-192-168-34-164 ctdbd[6708]: Thaw db: ctdb.tdb generation 1832634741 2023-11-23T19:18:54.820686+08:00 host-192-168-34-164 ctdbd[6708]: Release freeze handle for db ctdb.tdb 2023-11-23T19:18:54.820705+08:00 host-192-168-34-164 ctdbd[6708]: Resent calls for database=ctdb.tdb, generation=1832634741, count=0 2023-11-23T19:18:54.820908+08:00 host-192-168-34-164 ctdb-recovery[3208978]: 1 of 1 databases recovered 2023-11-23T19:18:54.820952+08:00 host-192-168-34-164 ctdbd[6708]: Recovery mode set to NORMAL 2023-11-23T19:18:59.822903+08:00 host-192-168-34-164 ctdbd[6708]: ../../ctdb/server/ctdb_recover.c:585Time out getting recovery lock, allowing recmode set anyway 2023-11-23T19:18:59.823217+08:00 host-192-168-34-164 ctdb-recovery[3208978]: Set recovery mode to NORMAL 2023-11-23T19:18:59.823441+08:00 host-192-168-34-164 ctdbd[6708]: Recovery has finished 2023-11-23T19:18:59.863137+08:00 host-192-168-34-164 ctdb-recovery[3208978]: recovered event finished 2023-11-23T19:18:59.863235+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run starting 2023-11-23T19:18:59.876831+08:00 host-192-168-34-164 ctdbd[6708]: Takeover of IP 101.101.34.172/16 on interface bond0.909 2023-11-23T19:18:59.877065+08:00 host-192-168-34-164 ctdbd[6708]: Takeover of IP 101.101.34.171/16 on interface bond0.909 2023-11-23T19:18:59.960718+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.172 before take ip 2023-11-23T19:18:59.960756+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.172 before take ip success 2023-11-23T19:18:59.960764+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ganesha-101.101.34.172 2023-11-23T19:19:00.030334+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip start ********** 2023-11-23T19:19:00.030369+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: container ganesha-101.101.34.172 is exist, restart it 2023-11-23T19:19:00.030375+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip end ************ 2023-11-23T19:19:00.137205+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.171 before take ip 2023-11-23T19:19:00.137245+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: stop ganesha-101.101.34.171 before take ip success 2023-11-23T19:19:00.137254+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ganesha-101.101.34.171 2023-11-23T19:19:00.172884+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip start ********** 2023-11-23T19:19:00.172915+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: container ganesha-101.101.34.171 is exist, restart it 2023-11-23T19:19:00.172921+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_takeip end ************ 2023-11-23T19:19:00.231998+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: Reconfiguring service "nfs"... 2023-11-23T19:19:00.237079+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run completed successfully 2023-11-23T19:19:00.237119+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ../../ctdb/server/ctdb_recoverd.c:1406 Recovery complete 2023-11-23T19:19:00.237126+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Resetting ban count to 0 for all nodes 2023-11-23T19:19:00.237132+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Just finished a recovery. New recoveries will now be suppressed for the rerecovery timeout (10 seconds) 2023-11-23T19:19:00.237138+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Disabling recoveries for 10 seconds 2023-11-23T19:19:01.319639+08:00 host-192-168-34-164 ctdbd[6708]: iface[bond0.909] has changed it's link status up => down 2023-11-23T19:19:01.320721+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ERROR: public network interface bond0 is down 2023-11-23T19:19:01.320756+08:00 host-192-168-34-164 ctdb-eventd[6746]: monitor event failed 2023-11-23T19:19:01.320794+08:00 host-192-168-34-164 ctdbd[6708]: monitor event failed - disabling node 2023-11-23T19:19:01.321211+08:00 host-192-168-34-164 ctdbd[6708]: Node became UNHEALTHY. Ask recovery master to reallocate IPs 2023-11-23T19:19:02.240905+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Interface bond0.909 changed state: 1 => 0 2023-11-23T19:19:02.241879+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Trigger takeoverrun 2023-11-23T19:19:02.241945+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run starting 2023-11-23T19:19:02.247568+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Failed to find node to cover ip 101.101.34.172 2023-11-23T19:19:02.247599+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Failed to find node to cover ip 101.101.34.171 2023-11-23T19:19:02.247604+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Failed to find node to cover ip 101.101.34.170 2023-11-23T19:19:02.247977+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP 101.101.34.172/16 on interface bond0.909 node:-1 2023-11-23T19:19:02.248325+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP 101.101.34.171/16 on interface bond0.909 node:-1 2023-11-23T19:19:02.248632+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP 101.101.34.170/16 on interface bond0.909 node:-1 2023-11-23T19:19:02.264822+08:00 host-192-168-34-164 ctdbd[6708]: Starting traverse on DB ctdb.tdb (id 4014103) 2023-11-23T19:19:02.265666+08:00 host-192-168-34-164 ctdbd[6708]: Ending traverse on DB ctdb.tdb (id 4014103), records 0 2023-11-23T19:19:06.321754+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled 2023-11-23T19:19:10.238115+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Reenabling recoveries after timeout 2023-11-23T19:19:11.322500+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled 2023-11-23T19:19:12.576955+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip start ********** 2023-11-23T19:19:12.576987+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ganesha-101.101.34.172 2023-11-23T19:19:12.576993+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: stop ganesha-101.101.34.172 success 2023-11-23T19:19:12.576997+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip end ************ 2023-11-23T19:19:16.323713+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled 2023-11-23T19:19:21.324677+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled 2023-11-23T19:19:22.863158+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip start ********** 2023-11-23T19:19:22.863196+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ganesha-101.101.34.171 2023-11-23T19:19:22.863202+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: stop ganesha-101.101.34.171 success 2023-11-23T19:19:22.863206+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip end ************ 2023-11-23T19:19:26.325402+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled 2023-11-23T19:19:29.254184+08:00 host-192-168-34-164 ctdb-takeover[3213011]: RELEASE_IP 101.101.34.170 failed on node 0, ret=110 2023-11-23T19:19:29.254226+08:00 host-192-168-34-164 ctdb-takeover[3213011]: Assigning banning credits to node 0 2023-11-23T19:19:29.254270+08:00 host-192-168-34-164 ctdb-takeover[3213011]: takeover run failed, ret=110 2023-11-23T19:19:29.254429+08:00 host-192-168-34-164 ctdb-recoverd[8356]: ctdb_takeover_run() failed 2023-11-23T19:19:29.254461+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run unsuccessful 2023-11-23T19:19:29.255705+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Assigned IP 101.101.34.170 not on an interface 2023-11-23T19:19:29.255725+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Trigger takeoverrun 2023-11-23T19:19:29.255937+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run starting 2023-11-23T19:19:29.262165+08:00 host-192-168-34-164 ctdb-takeover[3221874]: Failed to find node to cover ip 101.101.34.172 2023-11-23T19:19:29.262213+08:00 host-192-168-34-164 ctdb-takeover[3221874]: Failed to find node to cover ip 101.101.34.171 2023-11-23T19:19:29.262220+08:00 host-192-168-34-164 ctdb-takeover[3221874]: Failed to find node to cover ip 101.101.34.170 2023-11-23T19:19:31.326780+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring event was cancelled 2023-11-23T19:19:32.249734+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: ********** this is nfs_releaseip start ********** 2023-11-23T19:19:32.250519+08:00 host-192-168-34-164 ctdb-eventd[6746]: releaseip event timed out 2023-11-23T19:19:32.251108+08:00 host-192-168-34-164 ctdbd[6708]: Ignoring hung script for releaseip event 2023-11-23T19:19:32.251664+08:00 host-192-168-34-164 ctdbd[6708]: pnn 0 Invalid reqid 4014083 in ctdb_reply_control 2023-11-23T19:19:32.283931+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs: Reconfiguring service "nfs"... 2023-11-23T19:19:32.288809+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Takeover run completed successfully 2023-11-23T19:19:34.316783+08:00 host-192-168-34-164 ctdbd[6708]: Starting traverse on DB ctdb.tdb (id 4014184) 2023-11-23T19:19:34.317661+08:00 host-192-168-34-164 ctdbd[6708]: Ending traverse on DB ctdb.tdb (id 4014184), records 0 2023-11-23T19:19:36.348451+08:00 host-192-168-34-164 ctdb-eventd[6746]: 05.system: WARNING: System memory utilization 85% >= threshold 80% 2023-11-23T19:19:36.378082+08:00 host-192-168-34-164 ctdb-eventd[6746]: 10.interface: ERROR: public network interface bond0 is down 2023-11-23T19:19:36.378118+08:00 host-192-168-34-164 ctdb-eventd[6746]: monitor event failed ...... host-192-168-34-165: 2023-11-23T16:53:01.979285+08:00 host-192-168-34-165 ctdb-eventd[6610]: 05.system: NOTICE: System memory utilization 79% < threshold 80% 2023-11-23T16:53:12.671558+08:00 host-192-168-34-165 ctdb-eventd[6610]: 05.system: WARNING: System memory utilization 80% >= threshold 80% 2023-11-23T19:18:54.090491+08:00 host-192-168-34-165 ctdbd[6607]: dead count reached for node 0 2023-11-23T19:18:54.090547+08:00 host-192-168-34-165 ctdbd[6607]: Tearing down connection to dead node :0 2023-11-23T19:18:54.090598+08:00 host-192-168-34-165 ctdbd[6607]: 100.100.34.165:4379: node 100.100.34.164:4379 is dead: 1 connected 2023-11-23T19:18:54.090658+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.090696+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.090706+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Async operation failed with ret=-1 res=-1 opcode=80 2023-11-23T19:18:54.090714+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Async wait failed - fail_count=1 2023-11-23T19:18:54.090720+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-11-23T19:18:54.090727+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities 2023-11-23T19:18:54.090733+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities. 2023-11-23T19:18:56.311737+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Leader broadcast timeout 2023-11-23T19:18:56.311782+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Start election 2023-11-23T19:18:56.311874+08:00 host-192-168-34-165 ctdbd[6607]: Recovery mode set to ACTIVE 2023-11-23T19:18:56.312221+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin meta_pool ctdb_reclock 4 4096) 2023-11-23T19:18:58.313719+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Unable to take cluster lock - timeout 2023-11-23T19:19:01.313924+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Leader broadcast timeout 2023-11-23T19:19:01.313970+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Start election 2023-11-23T19:19:01.314313+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin meta_pool ctdb_reclock 4 4096) 2023-11-23T19:19:01.425277+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Cluster lock taken successfully 2023-11-23T19:19:01.425323+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Took cluster lock, leader=1 2023-11-23T19:19:01.426544+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Node:1 was in recovery mode. Start recovery process 2023-11-23T19:19:01.426591+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Node:2 was in recovery mode. Start recovery process 2023-11-23T19:19:01.426600+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1314 Starting do_recovery 2023-11-23T19:19:01.426606+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1358 Recovery initiated due to problem with node 0 2023-11-23T19:19:01.426968+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1388 Recovery - updated flags 2023-11-23T19:19:01.427006+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Set recovery_helper to "/usr/libexec/ctdb/ctdb_recovery_helper" 2023-11-23T19:19:01.436697+08:00 host-192-168-34-165 ctdb-recovery[1322606]: Set recovery mode to ACTIVE 2023-11-23T19:19:01.436760+08:00 host-192-168-34-165 ctdbd[6607]: Recovery has started 2023-11-23T19:19:01.491485+08:00 host-192-168-34-165 ctdb-recovery[1322606]: start_recovery event finished 2023-11-23T19:19:01.491851+08:00 host-192-168-34-165 ctdb-recovery[1322606]: updated VNNMAP 2023-11-23T19:19:01.491888+08:00 host-192-168-34-165 ctdb-recovery[1322606]: recover database 0x6645c6c4 2023-11-23T19:19:01.492199+08:00 host-192-168-34-165 ctdbd[6607]: Freeze db: ctdb.tdb 2023-11-23T19:19:01.506697+08:00 host-192-168-34-165 ctdbd[6607]: Thaw db: ctdb.tdb generation 996306493 2023-11-23T19:19:01.506719+08:00 host-192-168-34-165 ctdbd[6607]: Release freeze handle for db ctdb.tdb 2023-11-23T19:19:01.506736+08:00 host-192-168-34-165 ctdbd[6607]: Resent calls for database=ctdb.tdb, generation=996306493, count=0 2023-11-23T19:19:01.506985+08:00 host-192-168-34-165 ctdb-recovery[1322606]: 1 of 1 databases recovered 2023-11-23T19:19:01.507034+08:00 host-192-168-34-165 ctdbd[6607]: Recovery mode set to NORMAL 2023-11-23T19:19:01.615271+08:00 host-192-168-34-165 ctdb-recovery[1322606]: Set recovery mode to NORMAL 2023-11-23T19:19:01.615383+08:00 host-192-168-34-165 ctdbd[6607]: Recovery has finished 2023-11-23T19:19:01.668818+08:00 host-192-168-34-165 ctdb-recovery[1322606]: recovered event finished 2023-11-23T19:19:01.668895+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Takeover run starting 2023-11-23T19:19:01.668963+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Set takeover_helper to "/usr/libexec/ctdb/ctdb_takeover_helper" 2023-11-23T19:19:01.854276+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Takeover run completed successfully 2023-11-23T19:19:01.854317+08:00 host-192-168-34-165 ctdb-recoverd[7237]: ../../ctdb/server/ctdb_recoverd.c:1406 Recovery complete 2023-11-23T19:19:01.854325+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Resetting ban count to 0 for all nodes 2023-11-23T19:19:01.854331+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Just finished a recovery. New recoveries will now be suppressed for the rerecovery timeout (10 seconds) 2023-11-23T19:19:01.854341+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Disabling recoveries for 10 seconds 2023-11-23T19:19:11.855338+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Reenabling recoveries after timeout host-192-168-34-166: 2023-11-23T19:18:54.090703+08:00 host-192-168-34-166 ctdbd[6611]: dead count reached for node 0 2023-11-23T19:18:54.090747+08:00 host-192-168-34-166 ctdbd[6611]: Tearing down connection to dead node :0 2023-11-23T19:18:54.090789+08:00 host-192-168-34-166 ctdbd[6611]: 100.100.34.166:4379: node 100.100.34.164:4379 is dead: 1 connected 2023-11-23T19:18:54.090935+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.090962+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ctdb_control error: 'node is disconnected' 2023-11-23T19:18:54.090970+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Async operation failed with ret=-1 res=-1 opcode=80 2023-11-23T19:18:54.090975+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Async wait failed - fail_count=1 2023-11-23T19:18:54.090980+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-11-23T19:18:54.090985+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities 2023-11-23T19:18:54.090989+08:00 host-192-168-34-166 ctdb-recoverd[7179]: ../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities. 2023-11-23T19:18:56.312040+08:00 host-192-168-34-166 ctdbd[6611]: Recovery mode set to ACTIVE 2023-11-23T19:18:56.312591+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.admin meta_pool ctdb_reclock 4 4096) 2023-11-23T19:18:58.314499+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Unable to take cluster lock - timeout 2023-11-23T19:19:01.436922+08:00 host-192-168-34-166 ctdbd[6611]: Recovery has started 2023-11-23T19:19:01.492286+08:00 host-192-168-34-166 ctdbd[6611]: Freeze db: ctdb.tdb 2023-11-23T19:19:01.506836+08:00 host-192-168-34-166 ctdbd[6611]: Thaw db: ctdb.tdb generation 996306493 2023-11-23T19:19:01.506859+08:00 host-192-168-34-166 ctdbd[6611]: Release freeze handle for db ctdb.tdb 2023-11-23T19:19:01.506876+08:00 host-192-168-34-166 ctdbd[6611]: Resent calls for database=ctdb.tdb, generation=996306493, count=0 2023-11-23T19:19:01.507116+08:00 host-192-168-34-166 ctdbd[6611]: Recovery mode set to NORMAL 2023-11-23T19:19:01.615475+08:00 host-192-168-34-166 ctdbd[6611]: Recovery has finished 2023-11-23T19:19:01.669217+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Disabling takeover runs for 60 seconds 2023-11-23T19:19:01.681095+08:00 host-192-168-34-166 ctdbd[6611]: Takeover of IP 101.101.34.170/16 on interface bond0.909 2023-11-23T19:19:01.749681+08:00 host-192-168-34-166 ctdb-eventd[6614]: 10.interface: stop ganesha-101.101.34.170 before take ip 2023-11-23T19:19:01.749709+08:00 host-192-168-34-166 ctdb-eventd[6614]: 10.interface: stop ganesha-101.101.34.170 before take ip success 2023-11-23T19:19:01.749716+08:00 host-192-168-34-166 ctdb-eventd[6614]: 10.interface: ganesha-101.101.34.170 2023-11-23T19:19:01.788451+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: ********** this is nfs_takeip start ********** 2023-11-23T19:19:01.788480+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: container ganesha-101.101.34.170 is exist, restart it 2023-11-23T19:19:01.788485+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: ********** this is nfs_takeip end ************ 2023-11-23T19:19:01.839082+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Received leader broadcast, leader=1 2023-11-23T19:19:01.848160+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs: Reconfiguring service "nfs"... 2023-11-23T19:19:01.854402+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Reenabling takeover runs 2023-11-23T19:19:03.882907+08:00 host-192-168-34-166 ctdbd[6611]: Starting traverse on DB ctdb.tdb (id 324774) 2023-11-23T19:19:03.883833+08:00 host-192-168-34-166 ctdbd[6611]: Ending traverse on DB ctdb.tdb (id 324774), records 0 ------------------ ???? ------------------ ???: "Martin Schwenke" <martin at meltin.net>; ????: 2023?11?28?(???) ??10:22 ???: "samba"<samba at lists.samba.org>; ??: "tu.qiuping"<tu.qiuping at qq.com>; ??: Re: [Samba] CTDB: some problems about disconnecting the private network of ctdb leader nodes Hi, On Sun, 26 Nov 2023 21:13:21 +0800, "tu.qiuping via samba" <samba at lists.samba.org> wrote: > My ctdb version is 4.17.7 > > > My ctdb cluster configuration is correct and the cluster is healthy before operation. > > My cluster has three nodes, namely host-192-168-34-164, > host-192-168-34-165, and host-192-168-34-166. And the node > host-192-168-34-164 is the leader before operation. > > > I conducted network oscillation testing on node > host-192-168-34-164?I down the interface of private network of ctdb > at 19:18:54.091439. Then this node starts to do recovery. What I am > puzzled about is that at 19:18:59.822903, this node timed out > obtaining a lock, and the log shows ?Time out getting recovery lock, > allowing recovery mode set any way??and then host-192-168-34-164 > takeover all the virtual ip. > > > I checked the source code of ctdb and found that lines 578 to 582 of > the file samba/ctdb/server/ctdb_recover. c state: Timeout.&nbsp; > Consider this a success, not a failure, as we failed to set the > recovery lock which is what we wanted.&nbsp; This can be caused by > the cluster filesystem being very slow to arbitrate locks immediately > after a node failure.&nbsp; > > > I am puzzled why get the reclock timeout is considered successful. > Although a slow cluster file system may cause get reclock timeout, > disconnecting the private network of the leader node can also cause > this situation. Therefore, this disconnected node will take over all > virtual IPs, which will conflict with the virtual IPs of other normal > nodes?So, is it inappropriate to assume that get the reclock timeout > is successful in this situation? It is considered successful because it wasn't able to take the lock in a conflicting way. This part of the code doesn't have anything to do with selecting the leader node. It is a sanity check at the end of recovery to ensure that cluster lock can't be taken by nodes/processes that should not be able to take it. Given the way this lock is now used to determine the leader, this check is definitely in the wrong place, since this is a cluster management issue, not a recovery issue. If the leader becomes disconnected then the cluster lock will need to be released (e.g. by the underlying filesystem) before another node can be leader. 2 nodes should never be leader at the same time. > The logs of the three nodes are attached. I think mailman might have stripped the logs... :-( I think I need to see the logs before I can say anything very useful... peace & happiness, martin