tu.qiuping
2023-Jun-14 01:15 UTC
[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting
My ctdb version is 4.17.7 Hello, everyone. My ctdb cluster configuration is correct and the cluster is healthy before operation. My cluster has three nodes, namely 192.168.40.131?node 0?, 192.168.40.132?node 1?, and 192.168.40.133?node 2?. And the node 192.168.40.133 is the leader. I conducted network oscillation testing on node 192.168.40.133, and after a period of time, the lock update of this node failed, and at this time, the lock was taken away by node 0. Amazingly, after node 0 received the lock, it sent a message with leader=0 to node 1, but did not send it to node 2, even though the network of node 2 was healthy at this time. And when I restored the network of node 2, node 1 and node 2 kept trying to acquire the lock and reported an error: Unable to take cluster lock - contention. Here is the logs of three nodes: node 0? 2023-06-12T19:38:24.810747+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Reenabling takeover runs 2023-06-12T19:39:32.603600+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Leader broadcast timeout 2023-06-12T19:39:32.603791+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Start election 2023-06-12T19:39:32.603951+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery mode set to ACTIVE 2023-06-12T19:39:35.414699+08:00 host-192-168-40-131 ctdbd[2853172]: dead count reached for node 2 2023-06-12T19:39:35.414878+08:00 host-192-168-40-131 ctdbd[2853172]: Tearing down connection to dead node :2 2023-06-12T19:39:35.414975+08:00 host-192-168-40-131 ctdbd[2853172]: 192.136.40.131:4379: node 192.136.40.133:4379 is dead: 1 connected 2023-06-12T19:39:35.415107+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:35.415153+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:35.415177+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async operation failed with ret=-1 res=-1 opcode=16 2023-06-12T19:39:35.415199+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async wait failed - fail_count=1 2023-06-12T19:39:35.415215+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:504 Unable to set recovery mode. Recovery failed. 2023-06-12T19:39:35.415239+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1921 Unable to set recovery mode to active on cluster 2023-06-12T19:39:35.415257+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:35.415271+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:35.415285+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async operation failed with ret=-1 res=-1 opcode=80 2023-06-12T19:39:35.415296+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async wait failed - fail_count=1 2023-06-12T19:39:35.415310+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-06-12T19:39:35.415325+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities 2023-06-12T19:39:35.415341+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:2636 Unable to update node capabilities. 2023-06-12T19:39:39.008818+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4) 2023-06-12T19:39:39.229010+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Cluster lock taken successfully 2023-06-12T19:39:39.229085+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Took cluster lock, leader=0 2023-06-12T19:39:39.419851+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Node:0 was in recovery mode. Start recovery process 2023-06-12T19:39:39.419986+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Node:1 was in recovery mode. Start recovery process 2023-06-12T19:39:39.420005+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery 2023-06-12T19:39:39.420020+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1356 Recovery initiated due to problem with node 0 2023-06-12T19:39:39.420479+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1386 Recovery - updated flags 2023-06-12T19:39:39.420531+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Set recovery_helper to "/usr/libexec/ctdb/ctdb_recovery_helper" 2023-06-12T19:39:39.432698+08:00 host-192-168-40-131 ctdb-recovery[3065583]: Set recovery mode to ACTIVE 2023-06-12T19:39:39.432822+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery has started 2023-06-12T19:39:39.433998+08:00 host-192-168-40-131 ctdbd[2853172]: Monitoring event was cancelled 2023-06-12T19:39:39.523430+08:00 host-192-168-40-131 ctdb-recovery[3065583]: start_recovery event finished 2023-06-12T19:39:39.523783+08:00 host-192-168-40-131 ctdb-recovery[3065583]: updated VNNMAP 2023-06-12T19:39:39.523826+08:00 host-192-168-40-131 ctdb-recovery[3065583]: recover database 0x6645c6c4 2023-06-12T19:39:39.524090+08:00 host-192-168-40-131 ctdbd[2853172]: Freeze db: ctdb.tdb 2023-06-12T19:39:39.549471+08:00 host-192-168-40-131 ctdbd[2853172]: Thaw db: ctdb.tdb generation 629720908 2023-06-12T19:39:39.549571+08:00 host-192-168-40-131 ctdbd[2853172]: Release freeze handle for db ctdb.tdb 2023-06-12T19:39:39.549613+08:00 host-192-168-40-131 ctdbd[2853172]: Resent calls for database=ctdb.tdb, generation=629720908, count=0 2023-06-12T19:39:39.549895+08:00 host-192-168-40-131 ctdb-recovery[3065583]: 1 of 1 databases recovered 2023-06-12T19:39:39.549985+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery mode set to NORMAL 2023-06-12T19:39:39.779180+08:00 host-192-168-40-131 ctdb-recovery[3065583]: Set recovery mode to NORMAL 2023-06-12T19:39:39.779404+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery has finished 2023-06-12T19:39:39.871666+08:00 host-192-168-40-131 ctdb-recovery[3065583]: recovered event finished 2023-06-12T19:39:39.871811+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Takeover run starting 2023-06-12T19:39:39.871914+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Set takeover_helper to "/usr/libexec/ctdb/ctdb_takeover_helper" 2023-06-12T19:39:39.887613+08:00 host-192-168-40-131 ctdbd[2853172]: Takeover of IP 101.101.40.131/16 on interface bond1.906 2023-06-12T19:39:40.091901+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Takeover run completed successfully 2023-06-12T19:39:40.092002+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1404 Recovery complete 2023-06-12T19:39:40.092022+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Resetting ban count to 0 for all nodes 2023-06-12T19:39:40.092038+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Just finished a recovery. New recoveries will now be suppressed for the rerecovery timeout (10 seconds) 2023-06-12T19:39:40.092057+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Disabling recoveries for 10 seconds 2023-06-12T19:39:41.420521+08:00 host-192-168-40-131 ctdbd[2853172]: 192.136.40.131:4379: connected to 192.136.40.133:4379 - 2 connected 2023-06-12T19:39:41.424132+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Pushing updated flags for node 0 (0x0) 2023-06-12T19:39:41.424411+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Pushing updated flags for node 1 (0x0) 2023-06-12T19:39:41.424595+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Remote node 2 had flags 0x0, local had 0x2 - updating local 2023-06-12T19:39:41.424624+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Pushing updated flags for node 2 (0x0) 2023-06-12T19:39:41.424697+08:00 host-192-168-40-131 ctdbd[2853172]: Node 2 has changed flags - 0x2 -> 0x0 2023-06-12T19:39:41.615784+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Received leader broadcast, leader=2 2023-06-12T19:39:42.895351+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery mode set to ACTIVE 2023-06-12T19:39:48.425262+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Leader broadcast timeout 2023-06-12T19:39:48.425472+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Start election 2023-06-12T19:39:50.092932+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Reenabling recoveries after timeout 2023-06-12T19:39:51.420623+08:00 host-192-168-40-131 ctdbd[2853172]: dead count reached for node 2 2023-06-12T19:39:51.420823+08:00 host-192-168-40-131 ctdbd[2853172]: Tearing down connection to dead node :2 2023-06-12T19:39:51.420902+08:00 host-192-168-40-131 ctdbd[2853172]: 192.136.40.131:4379: node 192.136.40.133:4379 is dead: 1 connected 2023-06-12T19:39:51.421146+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:51.421208+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:51.421229+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async operation failed with ret=-1 res=-1 opcode=16 2023-06-12T19:39:51.421250+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async wait failed - fail_count=1 2023-06-12T19:39:51.421270+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:504 Unable to set recovery mode. Recovery failed. 2023-06-12T19:39:51.421288+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1921 Unable to set recovery mode to active on cluster 2023-06-12T19:39:51.421306+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:51.421340+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:51.421359+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async operation failed with ret=-1 res=-1 opcode=80 2023-06-12T19:39:51.421374+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Async wait failed - fail_count=1 2023-06-12T19:39:51.421387+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-06-12T19:39:51.421403+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities 2023-06-12T19:39:51.421418+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:2636 Unable to update node capabilities. 2023-06-12T19:39:56.422492+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Leader broadcast timeout 2023-06-12T19:39:56.422854+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Start election 2023-06-12T19:39:58.426518+08:00 host-192-168-40-131 ctdbd[2853172]: 192.136.40.131:4379: connected to 192.136.40.133:4379 - 2 connected ...... 2023-06-12T19:41:42.896165+08:00 host-192-168-40-131 ctdbd[2853172]: ../../ctdb/server/ctdb_recover.c:609 Been in recovery mode for too long. Dropping all IPS 2023-06-12T19:41:43.129050+08:00 host-192-168-40-131 ctdbd[2853172]: ../../ctdb/server/ctdb_takeover.c:1689 Released 2 public IPs node 1? 2023-06-12T19:38:21.816071+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Reenabling takeover runs 2023-06-12T19:38:23.567879+08:00 host-192-168-40-132 ctdbd[659192]: Node 0 has changed flags - 0x2 -> 0x0 2023-06-12T19:38:24.498391+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Disabling takeover runs for 60 seconds 2023-06-12T19:38:24.810767+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Reenabling takeover runs 2023-06-12T19:39:32.604151+08:00 host-192-168-40-132 ctdbd[659192]: Recovery mode set to ACTIVE 2023-06-12T19:39:32.604185+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout 2023-06-12T19:39:32.604497+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Start election 2023-06-12T19:39:34.005643+08:00 host-192-168-40-132 ctdbd[659192]: dead count reached for node 2 2023-06-12T19:39:34.005944+08:00 host-192-168-40-132 ctdbd[659192]: Tearing down connection to dead node :2 2023-06-12T19:39:34.006030+08:00 host-192-168-40-132 ctdbd[659192]: 192.136.40.132:4379: node 192.136.40.133:4379 is dead: 1 connected 2023-06-12T19:39:34.006220+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.006270+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.006292+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async operation failed with ret=-1 res=-1 opcode=16 2023-06-12T19:39:34.006312+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async wait failed - fail_count=1 2023-06-12T19:39:34.006327+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:504 Unable to set recovery mode. Recovery failed. 2023-06-12T19:39:34.006343+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:1921 Unable to set recovery mode to active on cluster 2023-06-12T19:39:34.006355+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.006368+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.006388+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async operation failed with ret=-1 res=-1 opcode=80 2023-06-12T19:39:34.006404+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async wait failed - fail_count=1 2023-06-12T19:39:34.006420+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-06-12T19:39:34.006445+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities 2023-06-12T19:39:34.006460+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:2636 Unable to update node capabilities. 2023-06-12T19:39:39.007830+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout 2023-06-12T19:39:39.008166+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Start election 2023-06-12T19:39:39.008670+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4) 2023-06-12T19:39:39.239919+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Unable to take cluster lock - contention 2023-06-12T19:39:39.432969+08:00 host-192-168-40-132 ctdbd[659192]: Recovery has started 2023-06-12T19:39:39.434278+08:00 host-192-168-40-132 ctdbd[659192]: Monitoring event was cancelled 2023-06-12T19:39:39.524137+08:00 host-192-168-40-132 ctdbd[659192]: Freeze db: ctdb.tdb 2023-06-12T19:39:39.549504+08:00 host-192-168-40-132 ctdbd[659192]: Thaw db: ctdb.tdb generation 629720908 2023-06-12T19:39:39.549541+08:00 host-192-168-40-132 ctdbd[659192]: Release freeze handle for db ctdb.tdb 2023-06-12T19:39:39.549572+08:00 host-192-168-40-132 ctdbd[659192]: Resent calls for database=ctdb.tdb, generation=629720908, count=0 2023-06-12T19:39:39.551083+08:00 host-192-168-40-132 ctdbd[659192]: Recovery mode set to NORMAL 2023-06-12T19:39:39.685181+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Received leader broadcast, leader=0 2023-06-12T19:39:39.779475+08:00 host-192-168-40-132 ctdbd[659192]: Recovery has finished 2023-06-12T19:39:39.872177+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Disabling takeover runs for 60 seconds 2023-06-12T19:39:40.092083+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Reenabling takeover runs 2023-06-12T19:39:42.013100+08:00 host-192-168-40-132 ctdbd[659192]: 192.136.40.132:4379: connected to 192.136.40.133:4379 - 2 connected 2023-06-12T19:39:42.617032+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Received leader broadcast, leader=2 2023-06-12T19:39:42.690377+08:00 host-192-168-40-132 ctdbd[659192]: Node 2 has changed flags - 0x2 -> 0x0 2023-06-12T19:39:42.895359+08:00 host-192-168-40-132 ctdbd[659192]: Recovery mode set to ACTIVE 2023-06-12T19:39:42.895786+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4) 2023-06-12T19:39:43.128864+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Unable to take cluster lock - contention 2023-06-12T19:39:48.243826+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout 2023-06-12T19:39:48.244196+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Start election 2023-06-12T19:39:52.012650+08:00 host-192-168-40-132 ctdbd[659192]: dead count reached for node 2 2023-06-12T19:39:52.013026+08:00 host-192-168-40-132 ctdbd[659192]: Tearing down connection to dead node :2 2023-06-12T19:39:52.013115+08:00 host-192-168-40-132 ctdbd[659192]: 192.136.40.132:4379: node 192.136.40.133:4379 is dead: 1 connected 2023-06-12T19:39:52.013315+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:52.013357+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:52.013377+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async operation failed with ret=-1 res=-1 opcode=16 2023-06-12T19:39:52.013395+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async wait failed - fail_count=1 2023-06-12T19:39:52.013412+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:504 Unable to set recovery mode. Recovery failed. 2023-06-12T19:39:52.013426+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:1921 Unable to set recovery mode to active on cluster 2023-06-12T19:39:52.013440+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:52.013451+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:52.013469+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async operation failed with ret=-1 res=-1 opcode=80 2023-06-12T19:39:52.013484+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Async wait failed - fail_count=1 2023-06-12T19:39:52.013496+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-06-12T19:39:52.013510+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities 2023-06-12T19:39:52.013548+08:00 host-192-168-40-132 ctdb-recoverd[659348]: ../../ctdb/server/ctdb_recoverd.c:2636 Unable to update node capabilities. 2023-06-12T19:39:58.361119+08:00 host-192-168-40-132 ctdbd[659192]: 192.136.40.132:4379: connected to 192.136.40.133:4379 - 2 connected ...... 2023-06-12T19:41:21.395073+08:00 host-192-168-40-132 ctdbd[659192]: 192.136.40.132:4379: connected to 192.136.40.133:4379 - 2 connected 2023-06-12T19:41:23.457821+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout 2023-06-12T19:41:23.457993+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Start election 2023-06-12T19:41:23.458442+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4) 2023-06-12T19:41:23.687664+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Unable to take cluster lock - contention 2023-06-12T19:41:33.460741+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout 2023-06-12T19:41:33.460987+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Start election 2023-06-12T19:41:33.461322+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4) 2023-06-12T19:41:33.691522+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Unable to take cluster lock - contention 2023-06-12T19:41:38.463097+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout 2023-06-12T19:41:38.463319+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Start election 2023-06-12T19:41:38.463547+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4) 2023-06-12T19:41:38.690525+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Unable to take cluster lock - contention 2023-06-12T19:41:42.896175+08:00 host-192-168-40-132 ctdbd[659192]: ../../ctdb/server/ctdb_recover.c:609 Been in recovery mode for too long. Dropping all IPS 2023-06-12T19:41:43.015057+08:00 host-192-168-40-132 ctdbd[659192]: ../../ctdb/server/ctdb_takeover.c:1689 Released 1 public IPs 2023-06-12T19:41:48.465489+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout 2023-06-12T19:41:48.465680+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Start election 2023-06-12T19:41:48.466097+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Attempting to take cluster lock (!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4) 2023-06-12T19:41:48.699376+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Unable to take cluster lock - contention 2023-06-12T19:41:53.467823+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout node 2 2023-06-12T19:38:24.810606+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Takeover run completed successfully 2023-06-12T19:39:34.343927+08:00 host-192-168-40-133 ctdbd[2520003]: dead count reached for node 0 2023-06-12T19:39:34.344224+08:00 host-192-168-40-133 ctdbd[2520003]: Tearing down connection to dead node :0 2023-06-12T19:39:34.344286+08:00 host-192-168-40-133 ctdbd[2520003]: 192.136.40.133:4379: node 192.136.40.131:4379 is dead: 1 connected 2023-06-12T19:39:34.344332+08:00 host-192-168-40-133 ctdbd[2520003]: dead count reached for node 1 2023-06-12T19:39:34.344346+08:00 host-192-168-40-133 ctdbd[2520003]: Tearing down connection to dead node :1 2023-06-12T19:39:34.344363+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.344385+08:00 host-192-168-40-133 ctdbd[2520003]: 192.136.40.133:4379: node 192.136.40.132:4379 is dead: 0 connected 2023-06-12T19:39:34.344402+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.344421+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Async operation failed with ret=-1 res=-1 opcode=80 2023-06-12T19:39:34.344444+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.344462+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ctdb_control error: 'node is disconnected' 2023-06-12T19:39:34.344498+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Async operation failed with ret=-1 res=-1 opcode=80 2023-06-12T19:39:34.344512+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Async wait failed - fail_count=2 2023-06-12T19:39:34.344528+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities. 2023-06-12T19:39:34.344540+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities 2023-06-12T19:39:34.344558+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:2636 Unable to update node capabilities. 2023-06-12T19:39:34.347698+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Assigned IP 101.101.40.131 not on an interface 2023-06-12T19:39:34.347741+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Trigger takeoverrun 2023-06-12T19:39:34.347831+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:2778 The vnnmap count is different from the number of active lmaster nodes: 3 vs 1 2023-06-12T19:39:34.347883+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery 2023-06-12T19:39:34.347902+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1356 Recovery initiated due to problem with node 2 2023-06-12T19:39:34.347968+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1386 Recovery - updated flags 2023-06-12T19:39:34.361318+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to ACTIVE 2023-06-12T19:39:34.361391+08:00 host-192-168-40-133 ctdb-recovery[2602609]: Set recovery mode to ACTIVE 2023-06-12T19:39:34.361534+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery has started 2023-06-12T19:39:34.432177+08:00 host-192-168-40-133 ctdb-recovery[2602609]: start_recovery event finished 2023-06-12T19:39:34.432300+08:00 host-192-168-40-133 ctdb-recovery[2602609]: updated VNNMAP 2023-06-12T19:39:34.432332+08:00 host-192-168-40-133 ctdb-recovery[2602609]: recover database 0x6645c6c4 2023-06-12T19:39:34.432552+08:00 host-192-168-40-133 ctdbd[2520003]: Freeze db: ctdb.tdb 2023-06-12T19:39:34.444853+08:00 host-192-168-40-133 ctdbd[2520003]: Thaw db: ctdb.tdb generation 2038499985 2023-06-12T19:39:34.444893+08:00 host-192-168-40-133 ctdbd[2520003]: Release freeze handle for db ctdb.tdb 2023-06-12T19:39:34.444929+08:00 host-192-168-40-133 ctdbd[2520003]: Resent calls for database=ctdb.tdb, generation=2038499985, count=0 2023-06-12T19:39:34.445165+08:00 host-192-168-40-133 ctdb-recovery[2602609]: 1 of 1 databases recovered 2023-06-12T19:39:34.445242+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to NORMAL 2023-06-12T19:39:37.680876+08:00 host-192-168-40-133 ctdbd[2520003]: ERROR: Daemon able to take recovery lock on "!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4" during recovery 2023-06-12T19:39:37.681222+08:00 host-192-168-40-133 ctdb-recovery[2602609]: failed to set recovery mode NORMAL on node 2, ret=-1 2023-06-12T19:39:37.681295+08:00 host-192-168-40-133 ctdb-recovery[2602609]: database recovery failed, ret=-1 2023-06-12T19:39:37.683075+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery 2023-06-12T19:39:37.683118+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1356 Recovery initiated due to problem with node 2 2023-06-12T19:39:37.683184+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1386 Recovery - updated flags 2023-06-12T19:39:37.696168+08:00 host-192-168-40-133 ctdb-recovery[2603390]: Set recovery mode to ACTIVE 2023-06-12T19:39:37.696303+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery has started 2023-06-12T19:39:37.765714+08:00 host-192-168-40-133 ctdb-recovery[2603390]: start_recovery event finished 2023-06-12T19:39:37.765836+08:00 host-192-168-40-133 ctdb-recovery[2603390]: updated VNNMAP 2023-06-12T19:39:37.765868+08:00 host-192-168-40-133 ctdb-recovery[2603390]: recover database 0x6645c6c4 2023-06-12T19:39:37.766049+08:00 host-192-168-40-133 ctdbd[2520003]: Freeze db: ctdb.tdb 2023-06-12T19:39:37.782767+08:00 host-192-168-40-133 ctdbd[2520003]: Thaw db: ctdb.tdb generation 17680608 2023-06-12T19:39:37.782805+08:00 host-192-168-40-133 ctdbd[2520003]: Release freeze handle for db ctdb.tdb 2023-06-12T19:39:37.782839+08:00 host-192-168-40-133 ctdbd[2520003]: Resent calls for database=ctdb.tdb, generation=17680608, count=0 2023-06-12T19:39:37.783066+08:00 host-192-168-40-133 ctdb-recovery[2603390]: 1 of 1 databases recovered 2023-06-12T19:39:37.783141+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to NORMAL 2023-06-12T19:39:38.012885+08:00 host-192-168-40-133 ctdbd[2520003]: ERROR: Daemon able to take recovery lock on "!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4" during recovery 2023-06-12T19:39:38.013088+08:00 host-192-168-40-133 ctdb-recovery[2603390]: failed to set recovery mode NORMAL on node 2, ret=-1 2023-06-12T19:39:38.013160+08:00 host-192-168-40-133 ctdb-recovery[2603390]: database recovery failed, ret=-1 2023-06-12T19:39:38.683213+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery 2023-06-12T19:39:38.683268+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1356 Recovery initiated due to problem with node 2 2023-06-12T19:39:38.683331+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1386 Recovery - updated flags 2023-06-12T19:39:38.696114+08:00 host-192-168-40-133 ctdb-recovery[2603477]: Set recovery mode to ACTIVE 2023-06-12T19:39:38.696251+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery has started 2023-06-12T19:39:38.766848+08:00 host-192-168-40-133 ctdb-recovery[2603477]: start_recovery event finished 2023-06-12T19:39:38.766949+08:00 host-192-168-40-133 ctdb-recovery[2603477]: updated VNNMAP 2023-06-12T19:39:38.766977+08:00 host-192-168-40-133 ctdb-recovery[2603477]: recover database 0x6645c6c4 2023-06-12T19:39:38.767114+08:00 host-192-168-40-133 ctdbd[2520003]: Freeze db: ctdb.tdb 2023-06-12T19:39:38.783705+08:00 host-192-168-40-133 ctdbd[2520003]: Thaw db: ctdb.tdb generation 1195788082 2023-06-12T19:39:38.783746+08:00 host-192-168-40-133 ctdbd[2520003]: Release freeze handle for db ctdb.tdb 2023-06-12T19:39:38.783782+08:00 host-192-168-40-133 ctdbd[2520003]: Resent calls for database=ctdb.tdb, generation=1195788082, count=0 2023-06-12T19:39:38.784015+08:00 host-192-168-40-133 ctdb-recovery[2603477]: 1 of 1 databases recovered 2023-06-12T19:39:38.784084+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to NORMAL 2023-06-12T19:39:39.012847+08:00 host-192-168-40-133 ctdbd[2520003]: ERROR: Daemon able to take recovery lock on "!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4" during recovery 2023-06-12T19:39:39.012980+08:00 host-192-168-40-133 ctdb-recovery[2603477]: failed to set recovery mode NORMAL on node 2, ret=-1 2023-06-12T19:39:39.013035+08:00 host-192-168-40-133 ctdb-recovery[2603477]: database recovery failed, ret=-1 2023-06-12T19:39:39.684162+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery 2023-06-12T19:39:39.684386+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1356 Recovery initiated due to problem with node 2 2023-06-12T19:39:39.684451+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1386 Recovery - updated flags 2023-06-12T19:39:39.697442+08:00 host-192-168-40-133 ctdb-recovery[2603566]: Set recovery mode to ACTIVE 2023-06-12T19:39:39.697579+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery has started 2023-06-12T19:39:39.768395+08:00 host-192-168-40-133 ctdb-recovery[2603566]: start_recovery event finished 2023-06-12T19:39:39.768527+08:00 host-192-168-40-133 ctdb-recovery[2603566]: updated VNNMAP 2023-06-12T19:39:39.768563+08:00 host-192-168-40-133 ctdb-recovery[2603566]: recover database 0x6645c6c4 2023-06-12T19:39:39.768781+08:00 host-192-168-40-133 ctdbd[2520003]: Freeze db: ctdb.tdb 2023-06-12T19:39:39.785087+08:00 host-192-168-40-133 ctdbd[2520003]: Thaw db: ctdb.tdb generation 1185443889 2023-06-12T19:39:39.785126+08:00 host-192-168-40-133 ctdbd[2520003]: Release freeze handle for db ctdb.tdb 2023-06-12T19:39:39.785158+08:00 host-192-168-40-133 ctdbd[2520003]: Resent calls for database=ctdb.tdb, generation=1185443889, count=0 2023-06-12T19:39:39.785483+08:00 host-192-168-40-133 ctdb-recovery[2603566]: 1 of 1 databases recovered 2023-06-12T19:39:39.785557+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to NORMAL 2023-06-12T19:39:40.016882+08:00 host-192-168-40-133 ctdb-recovery[2603566]: Set recovery mode to NORMAL 2023-06-12T19:39:40.016973+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery has finished 2023-06-12T19:39:40.086536+08:00 host-192-168-40-133 ctdb-recovery[2603566]: recovered event finished 2023-06-12T19:39:40.086620+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: Takeover run starting ????iPhone
Rowland Penny
2023-Jun-14 06:24 UTC
[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting
On 14/06/2023 02:15, tu.qiuping via samba wrote:> My ctdb version is 4.17.7 > > > Hello, everyone. > My ctdb cluster configuration is correct and the cluster is healthy before operation. > My cluster has three nodes, namely 192.168.40.131?node 0?, 192.168.40.132?node 1?, and 192.168.40.133?node 2?. And the node 192.168.40.133 is the leader. > > > I conducted network oscillation testing on node 192.168.40.133, and after a period of time, the lock update of this node failed, and at this time, the lock was taken away by node 0. > Amazingly, after node 0 received the lock, it sent a message with leader=0 to node 1, but did not send it to node 2, even though the network of node 2 was healthy at this time. > And when I restored the network of node 2, node 1 and node 2 kept trying to acquire the lock and reported an error: Unable to take cluster lock - contention. > >Can you please stop posting the same post to the list. If someone can help you and has time to do so, they will reply. Posting exactly the same post three times will not get you a faster response. Rowland
Martin Schwenke
2023-Jun-15 01:22 UTC
[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting
Hi, On Wed, 14 Jun 2023 09:15:58 +0800, "tu.qiuping via samba" <samba at lists.samba.org> wrote:> My ctdb version is 4.17.7> My ctdb cluster configuration is correct and the cluster is healthy before operation.> My cluster has three nodes, namely 192.168.40.131?node 0?, > 192.168.40.132?node 1?, and 192.168.40.133?node 2?. And the node > 192.168.40.133 is the leader.> I conducted network oscillation testing on node 192.168.40.133, and > after a period of time, the lock update of this node failed, and at > this time, the lock was taken away by node 0. Amazingly, after node 0 > received the lock, it sent a message with leader=0 to node 1, but did > not send it to node 2, even though the network of node 2 was healthy > at this time. And when I restored the network of node 2, node 1 and > node 2 kept trying to acquire the lock and reported an error: Unable > to take cluster lock - contention.It would be excellent if you could describe your network oscillation testing in some more details. I assume you are taking down a network connection for some time and then bringing it back up. The down and up timings would be useful for debugging. By the way, you should never be amazed by clustering. It is all based on timeout-based heuristics, so you will almost certainly see unexpected things. :-) I strongly suggest that you merge the logs by sorting them when trying to debug these types of issues. The main events are... Nodes 0 and 1 are worried about their leader: 2023-06-12T19:39:32.603600+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Leader broadcast timeout 2023-06-12T19:39:32.604185+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Leader broadcast timeout The nodes notice that node 2 has become disconnected from the rest of the cluster: 2023-06-12T19:39:34.005643+08:00 host-192-168-40-132 ctdbd[659192]: dead count reached for node 2 2023-06-12T19:39:34.343927+08:00 host-192-168-40-133 ctdbd[2520003]: dead count reached for node 0 2023-06-12T19:39:34.344332+08:00 host-192-168-40-133 ctdbd[2520003]: dead count reached for node 1 2023-06-12T19:39:35.414699+08:00 host-192-168-40-131 ctdbd[2853172]: dead count reached for node 2 During this time, node 2 completes recovery with itself: 2023-06-12T19:39:34.361318+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to ACTIVE ... 2023-06-12T19:39:34.445242+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery mode set to NORMAL Node 2 starts to have problems because it can now take the cluster lock during recovery: 2023-06-12T19:39:37.680876+08:00 host-192-168-40-133 ctdbd[2520003]: ERROR: Daemon able to take recovery lock on "!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool ctdb_reclock 4" during recovery Node 0 successfully takes the cluster lock: 2023-06-12T19:39:39.229085+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: Took cluster lock, leader=0 Node 0 runs a recovery for nodes 0 and 1: 2023-06-12T19:39:39.420005+08:00 host-192-168-40-131 ctdb-recoverd[2853595]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery ... 2023-06-12T19:39:39.432822+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery has started 2023-06-12T19:39:39.432969+08:00 host-192-168-40-132 ctdbd[659192]: Recovery has started ... 2023-06-12T19:39:39.549895+08:00 host-192-168-40-131 ctdb-recovery[3065583]: 1 of 1 databases recovered 2023-06-12T19:39:39.549985+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery mode set to NORMAL 2023-06-12T19:39:39.551083+08:00 host-192-168-40-132 ctdbd[659192]: Recovery mode set to NORMAL Node 2 still tries to recover because it thinks it is leader: 2023-06-12T19:39:39.684162+08:00 host-192-168-40-133 ctdb-recoverd[2520071]: ../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery Node 1 receives a leader broadcast (note that node 2 is still disconnected): 2023-06-12T19:39:39.685181+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Received leader broadcast, leader=0 Nodes 0 & 1 notice the return of node 2: 2023-06-12T19:39:41.420521+08:00 host-192-168-40-131 ctdbd[2853172]: 192.136.40.131:4379: connected to 192.136.40.133:4379 - 2 connected 2023-06-12T19:39:42.013100+08:00 host-192-168-40-132 ctdbd[659192]: 192.136.40.132:4379: connected to 192.136.40.133:4379 - 2 connected Node 2 things it is still leader, so sends a leader broadcast, received by node 0: 2023-06-12T19:39:42.617032+08:00 host-192-168-40-132 ctdb-recoverd[659348]: Received leader broadcast, leader=2 Note that there aren't enough logs from node 2 to know if it also received a leader broadcast from node 0. Now node 2 is gone again: 2023-06-12T19:39:51.420623+08:00 host-192-168-40-131 ctdbd[2853172]: dead count reached for node 2 This is only 10s after it was rejoined, so KeepaliveInterval and/or KeepaliveLimit must be greatly reduced from the defaults. It is hard to understand what is happening without knowing which configuration values have been changed from the defaults. I can see 1 main problem here: you have an unresolved split brain. Node 2 does not realise it has lost the cluster lock. It still thinks it is leader while node 0 is able to take the lock. It seems like you need to reduce timeouts on the cluster lock. For the standard fcntl helper, we added a configurable re-check time. The general situation could be helped by introducing a stronger notion of cluster membership (where membership is granted by the leader). At the moment nodes join the cluster when they are connected and this is too weak. Fixing this requires a major redesign, of the type we have spoken about at SambaXP for years but have not yet had time to do. By the way, putting "CTDB" in the subject is good because I don't read anything on the general samba list that doesn't have CTDB in the subject. Also, if you post a merge request for CTDB, please add me as a reviewer so I see it. I don't get time to check all merge requests. Thanks... peace & happiness, martin