tu.qiuping
2023-Jun-12 16:46 UTC
[Samba] Create network oscillations on the leader node, resulting in brain splitting
My ctdb version is 4.17.7 Hello, everyone. My ctdb cluster configuration is correct and the cluster is healthy before operation. My cluster has three nodes, namely 192.168.40.131?node 0?, 192.168.40.132?node 1?, and 192.168.40.133?node 2?. And the node 192.168.40.133 is the leader. I conducted network oscillation testing on node 192.168.40.133, and after a period of time, the lock update of this node failed, and at this time, the lock was taken away by node 0. Amazingly, after node 0 received the lock, it sent a message with leader=0 to node 1, but did not send it to node 2. After a short period of time, node 0 and node 1 received a broadcast with leader=2, and at this time, node 0 did not release the lock and believed that it was not the leadereven though the network of node 2 was healthy at this time. And when I restored the network of node 2, node 1 and node 2 kept trying to acquire the lock and reported an error: Unable to take cluster lock - contention. The logs of the three nodes are attached. ctdb status of the node 0: [root at host-192-168-40-131 ~]# ctdb status Number of nodes:3 pnn:0 192.136.40.131 OK (THIS NODE) pnn:1 192.136.40.132 OK pnn:2 192.136.40.133 UNHEALTHY Generation:629720908 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:RECOVERY (1) Leader:UNKNOWN ctdb status of the node 1: [root at host-192-168-40-132 tecs]# ctdb status bNumber of nodes:3 pnn:0 192.136.40.131 OK pnn:1 192.136.40.132 OK (THIS NODE) pnn:2 192.136.40.133 UNHEALTHY Generation:629720908 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:RECOVERY (1) Leader:UNKNOWN ctdb status of the node 2: [root at host-192-168-40-133 tecs]# ctdb status Number of nodes:3 pnn:0 192.136.40.131 UNHEALTHY pnn:1 192.136.40.132 UNHEALTHY pnn:2 192.136.40.133 OK (THIS NODE) Generation:1185443889 Size:1 hash:0 lmaster:2 Recovery mode:RECOVERY (1) Leader:UNKNOWN