thr3ads.net - samba - [Samba] 回复： CTDB: some problems about disconnecting the private network of ctdb leader nodes [Nov 2023]

If this information is useful, please help other people find it:
Share via:

Martin Schwenke

2023-Nov-28 02:22 UTC

[Samba] CTDB: some problems about disconnecting the private network of ctdb leader nodes

Hi,

On Sun, 26 Nov 2023 21:13:21 +0800, "tu.qiuping via samba"
<samba at lists.samba.org> wrote:
> My ctdb version is 4.17.7
> 
> 
> My ctdb cluster configuration is correct and the cluster is healthy before
operation.
> 
> My cluster has three nodes, namely host-192-168-34-164,
> host-192-168-34-165, and host-192-168-34-166. And the node
> host-192-168-34-164 is the leader before operation.
> 
> 
> I conducted network oscillation testing on node
> host-192-168-34-164?I down the interface of private network of ctdb
> at 19:18:54.091439. Then this node starts to do recovery. What I am
> puzzled about is that at 19:18:59.822903, this node timed out
> obtaining a lock, and the log shows ?Time out getting recovery lock,
> allowing recovery mode set any way??and then host-192-168-34-164
> takeover all the virtual ip.
> 
> 
> I checked the source code of ctdb and found that lines 578 to 582 of
> the file samba/ctdb/server/ctdb_recover. c state: Timeout.&nbsp;
> Consider this a success, not a failure, as we failed to set the
> recovery lock which is what we wanted.&nbsp; This can be caused by
> the cluster filesystem being very slow to arbitrate locks immediately
> after a node failure.&nbsp;
> 
> 
> I am puzzled why get the reclock timeout is considered successful.
> Although a slow cluster file system may cause get reclock timeout,
> disconnecting the private network of the leader node can also cause
> this situation. Therefore, this disconnected node will take over all
> virtual IPs, which will conflict with the virtual IPs of other normal
> nodes?So, is it inappropriate to assume that get the reclock timeout
> is successful in this situation?
It is considered successful because it wasn't able to take the lock in
a conflicting way.

This part of the code doesn't have anything to do with selecting the
leader node.  It is a sanity check at the end of recovery to ensure
that cluster lock can't be taken by nodes/processes that should not be
able to take it.  Given the way this lock is now used to determine the
leader, this check is definitely in the wrong place, since this is a
cluster management issue, not a recovery issue.

If the leader becomes disconnected then the cluster lock will need to
be released (e.g. by the underlying filesystem) before another node can
be leader.  2 nodes should never be leader at the same time.
> The logs of the three nodes are attached.
I think mailman might have stripped the logs...  :-(

I think I need to see the logs before I can say anything very useful...

peace & happiness,
martin

tu.qiuping

2023-Nov-28 15:18 UTC

head link

[Samba] 回复： CTDB: some problems about disconnecting the private network of ctdb leader nodes

Thank you for your reply,&nbsp;Martin.


As you said, "If the leader becomes disconnected then the cluster lock will
need to be released (e.g. by the underlying filesystem) before another node can
be leader.&nbsp; 2 nodes should never be leader at the same
time.",&nbsp;so what I want to ask is, how does the leader node know it
has a problem when the network is abnormal. From the log, it thinks that other
nodes are experiencing network anomalies and have taken over all virtual
IPs.&nbsp;Or is it true that cluster management is incomplete in this
situation.


I have another question, which is why there is no log of renewing reclock
failure for such a long time after disconnecting the network of the leader node,
as the duration of the lock I set is 2 seconds


What I think is that it should renew reclock failed and kept trying instead of
taking over all virtual IPs first ,&nbsp;after the leader node's network
is abnormal.


Here are the logs for three nodes?
host-192-168-34-164:
2023-11-23T19:18:54.091287+08:00 host-192-168-34-164 ctdbd[6708]: dead count
reached for node 1
2023-11-23T19:18:54.091331+08:00 host-192-168-34-164 ctdbd[6708]: Tearing down
connection to dead node :1
2023-11-23T19:18:54.091378+08:00 host-192-168-34-164 ctdbd[6708]:
100.100.34.164:4379: node 100.100.34.165:4379 is dead: 1 connected
2023-11-23T19:18:54.091411+08:00 host-192-168-34-164 ctdbd[6708]: dead count
reached for node 2
2023-11-23T19:18:54.091416+08:00 host-192-168-34-164 ctdbd[6708]: Tearing down
connection to dead node :2
2023-11-23T19:18:54.091439+08:00 host-192-168-34-164 ctdbd[6708]:
100.100.34.164:4379: node 100.100.34.166:4379 is dead: 0 connected
2023-11-23T19:18:54.091431+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091460+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091475+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async
operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.091491+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091498+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.091504+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async
operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.091523+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Async
wait failed - fail_count=2
2023-11-23T19:18:54.091532+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities.
2023-11-23T19:18:54.091538+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities.
2023-11-23T19:18:54.091543+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities.
2023-11-23T19:18:54.093043+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_recoverd.c:2782 The vnnmap count is different from the
number of active lmaster nodes: 3 vs 1
2023-11-23T19:18:54.093065+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_recoverd.c:1314 Starting do_recovery
2023-11-23T19:18:54.093070+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_recoverd.c:1358 Recovery initiated due to problem with
node 0
2023-11-23T19:18:54.093108+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_recoverd.c:1388 Recovery - updated flags
2023-11-23T19:18:54.127260+08:00 host-192-168-34-164 ctdbd[6708]: Recovery mode
set to ACTIVE
2023-11-23T19:18:54.127318+08:00 host-192-168-34-164 ctdb-recovery[3208978]: Set
recovery mode to ACTIVE
2023-11-23T19:18:54.127390+08:00 host-192-168-34-164 ctdbd[6708]: Recovery has
started
2023-11-23T19:18:54.175221+08:00 host-192-168-34-164 ctdb-recovery[3208978]:
start_recovery event finished
2023-11-23T19:18:54.175289+08:00 host-192-168-34-164 ctdb-recovery[3208978]:
updated VNNMAP
2023-11-23T19:18:54.175302+08:00 host-192-168-34-164 ctdb-recovery[3208978]:
recover database 0x6645c6c4
2023-11-23T19:18:54.175369+08:00 host-192-168-34-164 ctdbd[6708]: Freeze db:
ctdb.tdb
2023-11-23T19:18:54.820651+08:00 host-192-168-34-164 ctdbd[6708]: Thaw db:
ctdb.tdb generation 1832634741
2023-11-23T19:18:54.820686+08:00 host-192-168-34-164 ctdbd[6708]: Release freeze
handle for db ctdb.tdb
2023-11-23T19:18:54.820705+08:00 host-192-168-34-164 ctdbd[6708]: Resent calls
for database=ctdb.tdb, generation=1832634741, count=0
2023-11-23T19:18:54.820908+08:00 host-192-168-34-164 ctdb-recovery[3208978]: 1
of 1 databases recovered
2023-11-23T19:18:54.820952+08:00 host-192-168-34-164 ctdbd[6708]: Recovery mode
set to NORMAL
2023-11-23T19:18:59.822903+08:00 host-192-168-34-164 ctdbd[6708]:
../../ctdb/server/ctdb_recover.c:585Time out getting recovery lock, allowing
recmode set anyway
2023-11-23T19:18:59.823217+08:00 host-192-168-34-164 ctdb-recovery[3208978]: Set
recovery mode to NORMAL
2023-11-23T19:18:59.823441+08:00 host-192-168-34-164 ctdbd[6708]: Recovery has
finished
2023-11-23T19:18:59.863137+08:00 host-192-168-34-164 ctdb-recovery[3208978]:
recovered event finished
2023-11-23T19:18:59.863235+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Takeover run starting
2023-11-23T19:18:59.876831+08:00 host-192-168-34-164 ctdbd[6708]: Takeover of IP
101.101.34.172/16 on interface bond0.909
2023-11-23T19:18:59.877065+08:00 host-192-168-34-164 ctdbd[6708]: Takeover of IP
101.101.34.171/16 on interface bond0.909
2023-11-23T19:18:59.960718+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: stop ganesha-101.101.34.172 before take ip
2023-11-23T19:18:59.960756+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: stop ganesha-101.101.34.172 before take ip success
2023-11-23T19:18:59.960764+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: ganesha-101.101.34.172
2023-11-23T19:19:00.030334+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_takeip start **********
2023-11-23T19:19:00.030369+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
container ganesha-101.101.34.172 is exist, restart it
2023-11-23T19:19:00.030375+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_takeip end ************
2023-11-23T19:19:00.137205+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: stop ganesha-101.101.34.171 before take ip
2023-11-23T19:19:00.137245+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: stop ganesha-101.101.34.171 before take ip success
2023-11-23T19:19:00.137254+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: ganesha-101.101.34.171
2023-11-23T19:19:00.172884+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_takeip start **********
2023-11-23T19:19:00.172915+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
container ganesha-101.101.34.171 is exist, restart it
2023-11-23T19:19:00.172921+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_takeip end ************
2023-11-23T19:19:00.231998+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
Reconfiguring service "nfs"...
2023-11-23T19:19:00.237079+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Takeover run completed successfully
2023-11-23T19:19:00.237119+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
../../ctdb/server/ctdb_recoverd.c:1406 Recovery complete
2023-11-23T19:19:00.237126+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Resetting ban count to 0 for all nodes
2023-11-23T19:19:00.237132+08:00 host-192-168-34-164 ctdb-recoverd[8356]: Just
finished a recovery. New recoveries will now be suppressed for the rerecovery
timeout (10 seconds)
2023-11-23T19:19:00.237138+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Disabling recoveries for 10 seconds
2023-11-23T19:19:01.319639+08:00 host-192-168-34-164 ctdbd[6708]:
iface[bond0.909] has changed it's link status up =&gt; down
2023-11-23T19:19:01.320721+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: ERROR: public network interface bond0 is down
2023-11-23T19:19:01.320756+08:00 host-192-168-34-164 ctdb-eventd[6746]: monitor
event failed
2023-11-23T19:19:01.320794+08:00 host-192-168-34-164 ctdbd[6708]: monitor event
failed - disabling node
2023-11-23T19:19:01.321211+08:00 host-192-168-34-164 ctdbd[6708]: Node became
UNHEALTHY. Ask recovery master to reallocate IPs
2023-11-23T19:19:02.240905+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Interface bond0.909 changed state: 1 =&gt; 0
2023-11-23T19:19:02.241879+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Trigger takeoverrun
2023-11-23T19:19:02.241945+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Takeover run starting
2023-11-23T19:19:02.247568+08:00 host-192-168-34-164 ctdb-takeover[3213011]:
Failed to find node to cover ip 101.101.34.172
2023-11-23T19:19:02.247599+08:00 host-192-168-34-164 ctdb-takeover[3213011]:
Failed to find node to cover ip 101.101.34.171
2023-11-23T19:19:02.247604+08:00 host-192-168-34-164 ctdb-takeover[3213011]:
Failed to find node to cover ip 101.101.34.170
2023-11-23T19:19:02.247977+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP
101.101.34.172/16 on interface bond0.909&nbsp; node:-1
2023-11-23T19:19:02.248325+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP
101.101.34.171/16 on interface bond0.909&nbsp; node:-1
2023-11-23T19:19:02.248632+08:00 host-192-168-34-164 ctdbd[6708]: Release of IP
101.101.34.170/16 on interface bond0.909&nbsp; node:-1
2023-11-23T19:19:02.264822+08:00 host-192-168-34-164 ctdbd[6708]: Starting
traverse on DB ctdb.tdb (id 4014103)
2023-11-23T19:19:02.265666+08:00 host-192-168-34-164 ctdbd[6708]: Ending
traverse on DB ctdb.tdb (id 4014103), records 0
2023-11-23T19:19:06.321754+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring
event was cancelled
2023-11-23T19:19:10.238115+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Reenabling recoveries after timeout
2023-11-23T19:19:11.322500+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring
event was cancelled
2023-11-23T19:19:12.576955+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_releaseip start **********
2023-11-23T19:19:12.576987+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
ganesha-101.101.34.172
2023-11-23T19:19:12.576993+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
stop ganesha-101.101.34.172 success
2023-11-23T19:19:12.576997+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_releaseip end ************
2023-11-23T19:19:16.323713+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring
event was cancelled
2023-11-23T19:19:21.324677+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring
event was cancelled
2023-11-23T19:19:22.863158+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_releaseip start **********
2023-11-23T19:19:22.863196+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
ganesha-101.101.34.171
2023-11-23T19:19:22.863202+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
stop ganesha-101.101.34.171 success
2023-11-23T19:19:22.863206+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_releaseip end ************
2023-11-23T19:19:26.325402+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring
event was cancelled
2023-11-23T19:19:29.254184+08:00 host-192-168-34-164 ctdb-takeover[3213011]:
RELEASE_IP 101.101.34.170 failed on node 0, ret=110
2023-11-23T19:19:29.254226+08:00 host-192-168-34-164 ctdb-takeover[3213011]:
Assigning banning credits to node 0
2023-11-23T19:19:29.254270+08:00 host-192-168-34-164 ctdb-takeover[3213011]:
takeover run failed, ret=110
2023-11-23T19:19:29.254429+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
ctdb_takeover_run() failed
2023-11-23T19:19:29.254461+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Takeover run unsuccessful
2023-11-23T19:19:29.255705+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Assigned IP 101.101.34.170 not on an interface
2023-11-23T19:19:29.255725+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Trigger takeoverrun
2023-11-23T19:19:29.255937+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Takeover run starting
2023-11-23T19:19:29.262165+08:00 host-192-168-34-164 ctdb-takeover[3221874]:
Failed to find node to cover ip 101.101.34.172
2023-11-23T19:19:29.262213+08:00 host-192-168-34-164 ctdb-takeover[3221874]:
Failed to find node to cover ip 101.101.34.171
2023-11-23T19:19:29.262220+08:00 host-192-168-34-164 ctdb-takeover[3221874]:
Failed to find node to cover ip 101.101.34.170
2023-11-23T19:19:31.326780+08:00 host-192-168-34-164 ctdbd[6708]: Monitoring
event was cancelled
2023-11-23T19:19:32.249734+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
********** this is nfs_releaseip start **********
2023-11-23T19:19:32.250519+08:00 host-192-168-34-164 ctdb-eventd[6746]:
releaseip event timed out
2023-11-23T19:19:32.251108+08:00 host-192-168-34-164 ctdbd[6708]: Ignoring hung
script for releaseip event
2023-11-23T19:19:32.251664+08:00 host-192-168-34-164 ctdbd[6708]: pnn 0 Invalid
reqid 4014083 in ctdb_reply_control
2023-11-23T19:19:32.283931+08:00 host-192-168-34-164 ctdb-eventd[6746]: 60.nfs:
Reconfiguring service "nfs"...
2023-11-23T19:19:32.288809+08:00 host-192-168-34-164 ctdb-recoverd[8356]:
Takeover run completed successfully
2023-11-23T19:19:34.316783+08:00 host-192-168-34-164 ctdbd[6708]: Starting
traverse on DB ctdb.tdb (id 4014184)
2023-11-23T19:19:34.317661+08:00 host-192-168-34-164 ctdbd[6708]: Ending
traverse on DB ctdb.tdb (id 4014184), records 0
2023-11-23T19:19:36.348451+08:00 host-192-168-34-164 ctdb-eventd[6746]:
05.system: WARNING: System memory utilization 85% &gt;= threshold 80%
2023-11-23T19:19:36.378082+08:00 host-192-168-34-164 ctdb-eventd[6746]:
10.interface: ERROR: public network interface bond0 is down
2023-11-23T19:19:36.378118+08:00 host-192-168-34-164 ctdb-eventd[6746]: monitor
event failed

......




host-192-168-34-165:
2023-11-23T16:53:01.979285+08:00 host-192-168-34-165 ctdb-eventd[6610]:
05.system: NOTICE: System memory utilization 79% < threshold 80%
2023-11-23T16:53:12.671558+08:00 host-192-168-34-165 ctdb-eventd[6610]:
05.system: WARNING: System memory utilization 80% &gt;= threshold 80%
2023-11-23T19:18:54.090491+08:00 host-192-168-34-165 ctdbd[6607]: dead count
reached for node 0
2023-11-23T19:18:54.090547+08:00 host-192-168-34-165 ctdbd[6607]: Tearing down
connection to dead node :0
2023-11-23T19:18:54.090598+08:00 host-192-168-34-165 ctdbd[6607]:
100.100.34.165:4379: node 100.100.34.164:4379 is dead: 1 connected
2023-11-23T19:18:54.090658+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090696+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090706+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Async
operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.090714+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Async
wait failed - fail_count=1
2023-11-23T19:18:54.090720+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities.
2023-11-23T19:18:54.090727+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities
2023-11-23T19:18:54.090733+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities.
2023-11-23T19:18:56.311737+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Leader
broadcast timeout
2023-11-23T19:18:56.311782+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Start
election
2023-11-23T19:18:56.311874+08:00 host-192-168-34-165 ctdbd[6607]: Recovery mode
set to ACTIVE
2023-11-23T19:18:56.312221+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper
ceph client.admin meta_pool ctdb_reclock 4 4096)
2023-11-23T19:18:58.313719+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Unable
to take cluster lock - timeout
2023-11-23T19:19:01.313924+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Leader
broadcast timeout
2023-11-23T19:19:01.313970+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Start
election
2023-11-23T19:19:01.314313+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper
ceph client.admin meta_pool ctdb_reclock 4 4096)
2023-11-23T19:19:01.425277+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Cluster lock taken successfully
2023-11-23T19:19:01.425323+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Took
cluster lock, leader=1
2023-11-23T19:19:01.426544+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Node:1
was in recovery mode. Start recovery process
2023-11-23T19:19:01.426591+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Node:2
was in recovery mode. Start recovery process
2023-11-23T19:19:01.426600+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
../../ctdb/server/ctdb_recoverd.c:1314 Starting do_recovery
2023-11-23T19:19:01.426606+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
../../ctdb/server/ctdb_recoverd.c:1358 Recovery initiated due to problem with
node 0
2023-11-23T19:19:01.426968+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
../../ctdb/server/ctdb_recoverd.c:1388 Recovery - updated flags
2023-11-23T19:19:01.427006+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Set
recovery_helper to "/usr/libexec/ctdb/ctdb_recovery_helper"
2023-11-23T19:19:01.436697+08:00 host-192-168-34-165 ctdb-recovery[1322606]: Set
recovery mode to ACTIVE
2023-11-23T19:19:01.436760+08:00 host-192-168-34-165 ctdbd[6607]: Recovery has
started
2023-11-23T19:19:01.491485+08:00 host-192-168-34-165 ctdb-recovery[1322606]:
start_recovery event finished
2023-11-23T19:19:01.491851+08:00 host-192-168-34-165 ctdb-recovery[1322606]:
updated VNNMAP
2023-11-23T19:19:01.491888+08:00 host-192-168-34-165 ctdb-recovery[1322606]:
recover database 0x6645c6c4
2023-11-23T19:19:01.492199+08:00 host-192-168-34-165 ctdbd[6607]: Freeze db:
ctdb.tdb
2023-11-23T19:19:01.506697+08:00 host-192-168-34-165 ctdbd[6607]: Thaw db:
ctdb.tdb generation 996306493
2023-11-23T19:19:01.506719+08:00 host-192-168-34-165 ctdbd[6607]: Release freeze
handle for db ctdb.tdb
2023-11-23T19:19:01.506736+08:00 host-192-168-34-165 ctdbd[6607]: Resent calls
for database=ctdb.tdb, generation=996306493, count=0
2023-11-23T19:19:01.506985+08:00 host-192-168-34-165 ctdb-recovery[1322606]: 1
of 1 databases recovered
2023-11-23T19:19:01.507034+08:00 host-192-168-34-165 ctdbd[6607]: Recovery mode
set to NORMAL
2023-11-23T19:19:01.615271+08:00 host-192-168-34-165 ctdb-recovery[1322606]: Set
recovery mode to NORMAL
2023-11-23T19:19:01.615383+08:00 host-192-168-34-165 ctdbd[6607]: Recovery has
finished
2023-11-23T19:19:01.668818+08:00 host-192-168-34-165 ctdb-recovery[1322606]:
recovered event finished
2023-11-23T19:19:01.668895+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Takeover run starting
2023-11-23T19:19:01.668963+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Set
takeover_helper to "/usr/libexec/ctdb/ctdb_takeover_helper"
2023-11-23T19:19:01.854276+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Takeover run completed successfully
2023-11-23T19:19:01.854317+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
../../ctdb/server/ctdb_recoverd.c:1406 Recovery complete
2023-11-23T19:19:01.854325+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Resetting ban count to 0 for all nodes
2023-11-23T19:19:01.854331+08:00 host-192-168-34-165 ctdb-recoverd[7237]: Just
finished a recovery. New recoveries will now be suppressed for the rerecovery
timeout (10 seconds)
2023-11-23T19:19:01.854341+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Disabling recoveries for 10 seconds
2023-11-23T19:19:11.855338+08:00 host-192-168-34-165 ctdb-recoverd[7237]:
Reenabling recoveries after timeout



host-192-168-34-166:
2023-11-23T19:18:54.090703+08:00 host-192-168-34-166 ctdbd[6611]: dead count
reached for node 0
2023-11-23T19:18:54.090747+08:00 host-192-168-34-166 ctdbd[6611]: Tearing down
connection to dead node :0
2023-11-23T19:18:54.090789+08:00 host-192-168-34-166 ctdbd[6611]:
100.100.34.166:4379: node 100.100.34.164:4379 is dead: 1 connected
2023-11-23T19:18:54.090935+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090962+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
ctdb_control error: 'node is disconnected'
2023-11-23T19:18:54.090970+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Async
operation failed with ret=-1 res=-1 opcode=80
2023-11-23T19:18:54.090975+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Async
wait failed - fail_count=1
2023-11-23T19:18:54.090980+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities.
2023-11-23T19:18:54.090985+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities
2023-11-23T19:18:54.090989+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
../../ctdb/server/ctdb_recoverd.c:2640 Unable to update node capabilities.
2023-11-23T19:18:56.312040+08:00 host-192-168-34-166 ctdbd[6611]: Recovery mode
set to ACTIVE
2023-11-23T19:18:56.312591+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper
ceph client.admin meta_pool ctdb_reclock 4 4096)
2023-11-23T19:18:58.314499+08:00 host-192-168-34-166 ctdb-recoverd[7179]: Unable
to take cluster lock - timeout
2023-11-23T19:19:01.436922+08:00 host-192-168-34-166 ctdbd[6611]: Recovery has
started
2023-11-23T19:19:01.492286+08:00 host-192-168-34-166 ctdbd[6611]: Freeze db:
ctdb.tdb
2023-11-23T19:19:01.506836+08:00 host-192-168-34-166 ctdbd[6611]: Thaw db:
ctdb.tdb generation 996306493
2023-11-23T19:19:01.506859+08:00 host-192-168-34-166 ctdbd[6611]: Release freeze
handle for db ctdb.tdb
2023-11-23T19:19:01.506876+08:00 host-192-168-34-166 ctdbd[6611]: Resent calls
for database=ctdb.tdb, generation=996306493, count=0
2023-11-23T19:19:01.507116+08:00 host-192-168-34-166 ctdbd[6611]: Recovery mode
set to NORMAL
2023-11-23T19:19:01.615475+08:00 host-192-168-34-166 ctdbd[6611]: Recovery has
finished
2023-11-23T19:19:01.669217+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
Disabling takeover runs for 60 seconds
2023-11-23T19:19:01.681095+08:00 host-192-168-34-166 ctdbd[6611]: Takeover of IP
101.101.34.170/16 on interface bond0.909
2023-11-23T19:19:01.749681+08:00 host-192-168-34-166 ctdb-eventd[6614]:
10.interface: stop ganesha-101.101.34.170 before take ip
2023-11-23T19:19:01.749709+08:00 host-192-168-34-166 ctdb-eventd[6614]:
10.interface: stop ganesha-101.101.34.170 before take ip success
2023-11-23T19:19:01.749716+08:00 host-192-168-34-166 ctdb-eventd[6614]:
10.interface: ganesha-101.101.34.170
2023-11-23T19:19:01.788451+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs:
********** this is nfs_takeip start **********
2023-11-23T19:19:01.788480+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs:
container ganesha-101.101.34.170 is exist, restart it
2023-11-23T19:19:01.788485+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs:
********** this is nfs_takeip end ************
2023-11-23T19:19:01.839082+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
Received leader broadcast, leader=1
2023-11-23T19:19:01.848160+08:00 host-192-168-34-166 ctdb-eventd[6614]: 60.nfs:
Reconfiguring service "nfs"...
2023-11-23T19:19:01.854402+08:00 host-192-168-34-166 ctdb-recoverd[7179]:
Reenabling takeover runs
2023-11-23T19:19:03.882907+08:00 host-192-168-34-166 ctdbd[6611]: Starting
traverse on DB ctdb.tdb (id 324774)
2023-11-23T19:19:03.883833+08:00 host-192-168-34-166 ctdbd[6611]: Ending
traverse on DB ctdb.tdb (id 324774), records 0



------------------&nbsp;????&nbsp;------------------
???:                                                                            
"Martin Schwenke"                                                     
<martin at meltin.net&gt;;
????:&nbsp;2023?11?28?(???) ??10:22
???:&nbsp;"samba"<samba at lists.samba.org&gt;;
??:&nbsp;"tu.qiuping"<tu.qiuping at qq.com&gt;;
??:&nbsp;Re: [Samba] CTDB: some problems about disconnecting the private
network of ctdb leader nodes



Hi,

On Sun, 26 Nov 2023 21:13:21 +0800, "tu.qiuping via samba"
<samba at lists.samba.org&gt; wrote:

&gt; My ctdb version is 4.17.7
&gt; 
&gt; 
&gt; My ctdb cluster configuration is correct and the cluster is healthy
before operation.
&gt; 
&gt; My cluster has three nodes, namely host-192-168-34-164,
&gt; host-192-168-34-165, and host-192-168-34-166. And the node
&gt; host-192-168-34-164 is the leader before operation.
&gt; 
&gt; 
&gt; I conducted network oscillation testing on node
&gt; host-192-168-34-164?I down the interface of private network of ctdb
&gt; at 19:18:54.091439. Then this node starts to do recovery. What I am
&gt; puzzled about is that at 19:18:59.822903, this node timed out
&gt; obtaining a lock, and the log shows ?Time out getting recovery lock,
&gt; allowing recovery mode set any way??and then host-192-168-34-164
&gt; takeover all the virtual ip.
&gt; 
&gt; 
&gt; I checked the source code of ctdb and found that lines 578 to 582 of
&gt; the file samba/ctdb/server/ctdb_recover. c state:
Timeout.&amp;nbsp;
&gt; Consider this a success, not a failure, as we failed to set the
&gt; recovery lock which is what we wanted.&amp;nbsp; This can be caused
by
&gt; the cluster filesystem being very slow to arbitrate locks immediately
&gt; after a node failure.&amp;nbsp;
&gt; 
&gt; 
&gt; I am puzzled why get the reclock timeout is considered successful.
&gt; Although a slow cluster file system may cause get reclock timeout,
&gt; disconnecting the private network of the leader node can also cause
&gt; this situation. Therefore, this disconnected node will take over all
&gt; virtual IPs, which will conflict with the virtual IPs of other normal
&gt; nodes?So, is it inappropriate to assume that get the reclock timeout
&gt; is successful in this situation?

It is considered successful because it wasn't able to take the lock in
a conflicting way.

This part of the code doesn't have anything to do with selecting the
leader node.&nbsp; It is a sanity check at the end of recovery to ensure
that cluster lock can't be taken by nodes/processes that should not be
able to take it.&nbsp; Given the way this lock is now used to determine the
leader, this check is definitely in the wrong place, since this is a
cluster management issue, not a recovery issue.

If the leader becomes disconnected then the cluster lock will need to
be released (e.g. by the underlying filesystem) before another node can
be leader.&nbsp; 2 nodes should never be leader at the same time.

&gt; The logs of the three nodes are attached.

I think mailman might have stripped the logs...&nbsp; :-(

I think I need to see the logs before I can say anything very useful...

peace &amp; happiness,
martin

samba - Nov 2023 - 回复： CTDB: some problems about disconnecting the private network of ctdb leader nodes

[Samba] CTDB: some problems about disconnecting the private network of ctdb leader nodes

[Samba] 回复： CTDB: some problems about disconnecting the private network of ctdb leader nodes