thr3ads.net - samba - [Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting [Jun 2023]

If this information is useful, please help other people find it:
Share via:

tu.qiuping

2023-Jun-15 07:48 UTC

[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting

Thank you so much for taking the time to reply to my email.


I'm sorry that I missed the log of node 2. The log of node 2 after
2023-06-12T19:39:40.086620 is as follows:


2023-06-12T19:39:40.086536+08:00 host-192-168-40-133 ctdb-recovery[2603566]:
recovered event finished
2023-06-12T19:39:40.086620+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Takeover run starting
2023-06-12T19:39:40.100386+08:00 host-192-168-40-133 ctdbd[2520003]: Takeover of
IP 101.101.40.133/16 on interface bond1.906
2023-06-12T19:39:40.101038+08:00 host-192-168-40-133 ctdbd[2520003]: Takeover of
IP 101.101.40.132/16 on interface bond1.906
2023-06-12T19:39:40.101633+08:00 host-192-168-40-133 ctdbd[2520003]: Takeover of
IP 101.101.40.131/16 on interface bond1.906
2023-06-12T19:39:40.424385+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Takeover run completed successfully
2023-06-12T19:39:40.424452+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
../../ctdb/server/ctdb_recoverd.c:1404 Recovery complete
2023-06-12T19:39:40.424467+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Resetting ban count to 0 for all nodes
2023-06-12T19:39:40.424500+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Just finished a recovery. New recoveries will now be suppressed for the
rerecovery timeout (10 seconds)
2023-06-12T19:39:40.424516+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Disabling recoveries for 10 seconds
2023-06-12T19:39:40.687836+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Takeover run starting
2023-06-12T19:39:40.774038+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Takeover run completed successfully
2023-06-12T19:39:41.420552+08:00 host-192-168-40-133 ctdbd[2520003]:
192.136.40.133:4379: connected to 192.136.40.131:4379 - 1 connected
2023-06-12T19:39:41.424307+08:00 host-192-168-40-133 ctdbd[2520003]: Node 0 has
changed flags - 0x2 -&gt; 0x0
2023-06-12T19:39:42.013192+08:00 host-192-168-40-133 ctdbd[2520003]:
192.136.40.133:4379: connected to 192.136.40.132:4379 - 2 connected
2023-06-12T19:39:42.690010+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Remote node 1 had flags 0x0, local had 0x2 - updating local
2023-06-12T19:39:42.690138+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Pushing updated flags for node 1 (0x0)
2023-06-12T19:39:42.690201+08:00 host-192-168-40-133 ctdbd[2520003]: Node 1 has
changed flags - 0x2 -&gt; 0x0
2023-06-12T19:39:42.690282+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Pushing updated flags for node 2 (0x0)
2023-06-12T19:39:42.866980+08:00 host-192-168-40-133 ctdbd[2520003]: Lock
contention during renew: -16
2023-06-12T19:39:42.888760+08:00 host-192-168-40-133 ctdbd[2520003]:
/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper: Failed to drop lock on RADOS
object 'ctdb_reclock' - (No such file or directory)
2023-06-12T19:39:42.894979+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Cluster lock helper terminated
2023-06-12T19:39:42.895168+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Start election
2023-06-12T19:39:42.895267+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery
mode set to ACTIVE
2023-06-12T19:39:42.895637+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Attempting to take cluster lock
(!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool
ctdb_reclock 4)
2023-06-12T19:39:43.119817+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Unable to take cluster lock - contention
2023-06-12T19:39:47.897800+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Leader broadcast timeout
2023-06-12T19:39:47.898145+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Start election
2023-06-12T19:39:50.354774+08:00 host-192-168-40-133 ctdbd[2520003]: dead count
reached for node 1
2023-06-12T19:39:50.355104+08:00 host-192-168-40-133 ctdbd[2520003]: Tearing
down connection to dead node :1
2023-06-12T19:39:50.355153+08:00 host-192-168-40-133 ctdbd[2520003]:
192.136.40.133:4379: node 192.136.40.132:4379 is dead: 1 connected
2023-06-12T19:39:50.355322+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:50.355380+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:50.355438+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Async operation failed with ret=-1 res=-1 opcode=16
2023-06-12T19:39:50.355489+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:50.355539+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:50.355555+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Async operation failed with ret=-1 res=-1 opcode=80
2023-06-12T19:39:50.424654+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Reenabling recoveries after timeout
2023-06-12T19:39:52.355600+08:00 host-192-168-40-133 ctdbd[2520003]: dead count
reached for node 0
2023-06-12T19:39:52.355946+08:00 host-192-168-40-133 ctdbd[2520003]: Tearing
down connection to dead node :0
2023-06-12T19:39:52.356002+08:00 host-192-168-40-133 ctdbd[2520003]:
192.136.40.133:4379: node 192.136.40.131:4379 is dead: 0 connected
2023-06-12T19:39:52.356236+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:52.356282+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:52.356302+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Async operation failed with ret=-1 res=-1 opcode=16
2023-06-12T19:39:52.356321+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Async wait failed - fail_count=2
2023-06-12T19:39:52.356339+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
../../ctdb/server/ctdb_recoverd.c:504 Unable to set recovery mode. Recovery
failed.
2023-06-12T19:39:52.356358+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
../../ctdb/server/ctdb_recoverd.c:1921 Unable to set recovery mode to active on
cluster
2023-06-12T19:39:52.356374+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:52.356392+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
ctdb_control error: 'node is disconnected'
2023-06-12T19:39:52.356408+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Async operation failed with ret=-1 res=-1 opcode=80
2023-06-12T19:39:52.356431+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Async wait failed - fail_count=2
2023-06-12T19:39:52.356443+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
../../ctdb/server/ctdb_client.c:1558 Failed to read node capabilities.
2023-06-12T19:39:52.356455+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
../../ctdb/server/ctdb_recoverd.c:456 Failed to get node capabilities
2023-06-12T19:39:52.356467+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
../../ctdb/server/ctdb_recoverd.c:2636 Unable to update node capabilities.
2023-06-12T19:39:57.357507+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Leader broadcast timeout
2023-06-12T19:39:57.357811+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Start election
2023-06-12T19:39:57.358129+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Attempting to take cluster lock
(!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool
ctdb_reclock 4)
2023-06-12T19:39:57.580652+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Unable to take cluster lock - contention
2023-06-12T19:39:58.361021+08:00 host-192-168-40-133 ctdbd[2520003]:
192.136.40.133:4379: connected to 192.136.40.132:4379 - 1 connected
......
2023-06-12T19:41:21.460624+08:00 host-192-168-40-133 ctdbd[2520003]:
192.136.40.133:4379: connected to 192.136.40.131:4379 - 2 connected
2023-06-12T19:41:28.459502+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Leader broadcast timeout
2023-06-12T19:41:28.459854+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Start election
2023-06-12T19:41:28.460222+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Attempting to take cluster lock
(!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool
ctdb_reclock 4)
2023-06-12T19:41:28.685596+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Unable to take cluster lock - contention
2023-06-12T19:41:38.462504+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Leader broadcast timeout
2023-06-12T19:41:38.462856+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Start election
2023-06-12T19:41:38.463151+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Attempting to take cluster lock
(!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool
ctdb_reclock 4)
2023-06-12T19:41:38.699469+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Unable to take cluster lock - contention
2023-06-12T19:41:42.896272+08:00 host-192-168-40-133 ctdbd[2520003]:
../../ctdb/server/ctdb_recover.c:609 Been in recovery mode for too long.
Dropping all IPS
2023-06-12T19:41:42.899299+08:00 host-192-168-40-133 ctdbd[2520003]:
../../ctdb/server/ctdb_takeover.c:1689 Released 0 public IPs
2023-06-12T19:41:43.464497+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Leader broadcast timeout
2023-06-12T19:41:43.464592+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Start election
2023-06-12T19:41:43.464971+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Attempting to take cluster lock
(!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin fs_pool
ctdb_reclock 4)
2023-06-12T19:41:43.688413+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Unable to take cluster lock - contention



As you can see from the log, node 2 did not receive the broadcast from node 0.


At the same time,&nbsp;my CTDB cluster is setup as follows:
MonitorInterval=5
RecoveryBanPeriod=5
KeepaliveInterval=1
KeepaliveLimit=2
ElectionTimeout=2
And the renew time of the cluster lock is 4s, and the default values are used
for other settings.


And, I use the following command to test node 2 for network vibration.
==&gt; for i in {1..1000};do ifdown bond1.906&amp;&amp;sleep
5s&amp;&amp;ifup bond1.906&amp;&amp;sleep 5s;done


It should be noted that interface bond1.906 contains private addresses and
public addresses of CTDB


Can you help me again to see what is causing this?


thanks
tuqp







------------------ Original ------------------
From: Martin Schwenke <martin at meltin.net&gt;
Date: Thu,Jun 15,2023 9:22 AM
To: tu.qiuping <tu.qiuping at qq.com&gt;
Cc: samba <samba at lists.samba.org&gt;
Subject: Re: [Samba] CTDB: test network oscillations on the leader
node,resulting in brain splitting



Hi,

On Wed, 14 Jun 2023 09:15:58 +0800, "tu.qiuping via samba"
 wrote:

&gt; My ctdb version is 4.17.7

&gt; My ctdb cluster configuration is correct and the cluster is healthy
before operation.

&gt; My cluster has three nodes, namely 192.168.40.131?node 0?,
&gt; 192.168.40.132?node 1?, and 192.168.40.133?node 2?. And the node
&gt; 192.168.40.133 is the leader.

&gt; I conducted network oscillation testing on node 192.168.40.133, and
&gt; after a period of time, the lock update of this node failed, and at
&gt; this time, the lock was taken away by node 0. Amazingly, after node 0
&gt; received the lock, it sent a message with leader=0 to node 1, but did
&gt; not send it to node 2, even though the network of node 2 was healthy
&gt; at this time. And when I restored the network of node 2, node 1 and
&gt; node 2 kept trying to acquire the lock and reported an error: Unable
&gt; to take cluster lock - contention.

It would be excellent if you could describe your network oscillation
testing in some more details.  I assume you are taking down a network
connection for some time and then bringing it back up.  The down and up
timings would be useful for debugging.

By the way, you should never be amazed by clustering.  It is all based
on timeout-based heuristics, so you will almost certainly see unexpected
things.  :-)

I strongly suggest that you merge the logs by sorting them when trying
to debug these types of issues.

The main events are...

Nodes 0 and 1 are worried about their leader:

  2023-06-12T19:39:32.603600+08:00 host-192-168-40-131 ctdb-recoverd[2853595]:
Leader broadcast timeout
  2023-06-12T19:39:32.604185+08:00 host-192-168-40-132 ctdb-recoverd[659348]:
Leader broadcast timeout

The nodes notice that node 2 has become disconnected from the rest of
the cluster:

  2023-06-12T19:39:34.005643+08:00 host-192-168-40-132 ctdbd[659192]: dead count
reached for node 2
  2023-06-12T19:39:34.343927+08:00 host-192-168-40-133 ctdbd[2520003]: dead
count reached for node 0
  2023-06-12T19:39:34.344332+08:00 host-192-168-40-133 ctdbd[2520003]: dead
count reached for node 1
  2023-06-12T19:39:35.414699+08:00 host-192-168-40-131 ctdbd[2853172]: dead
count reached for node 2

During this time, node 2 completes recovery with itself:

  2023-06-12T19:39:34.361318+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery
mode set to ACTIVE
  ...
  2023-06-12T19:39:34.445242+08:00 host-192-168-40-133 ctdbd[2520003]: Recovery
mode set to NORMAL

Node 2 starts to have problems because it can now take the cluster
lock during recovery:

  2023-06-12T19:39:37.680876+08:00 host-192-168-40-133 ctdbd[2520003]: ERROR:
Daemon able to take recovery lock on
"!/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper clove client.admin
fs_pool ctdb_reclock 4" during recovery

Node 0 successfully takes the cluster lock:

  2023-06-12T19:39:39.229085+08:00 host-192-168-40-131 ctdb-recoverd[2853595]:
Took cluster lock, leader=0

Node 0 runs a recovery for nodes 0 and 1:

  2023-06-12T19:39:39.420005+08:00 host-192-168-40-131 ctdb-recoverd[2853595]:
../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery
  ...
  2023-06-12T19:39:39.432822+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery
has started
  2023-06-12T19:39:39.432969+08:00 host-192-168-40-132 ctdbd[659192]: Recovery
has started
  ...
  2023-06-12T19:39:39.549895+08:00 host-192-168-40-131 ctdb-recovery[3065583]: 1
of 1 databases recovered
  2023-06-12T19:39:39.549985+08:00 host-192-168-40-131 ctdbd[2853172]: Recovery
mode set to NORMAL
  2023-06-12T19:39:39.551083+08:00 host-192-168-40-132 ctdbd[659192]: Recovery
mode set to NORMAL

Node 2 still tries to recover because it thinks it is leader:

  2023-06-12T19:39:39.684162+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
../../ctdb/server/ctdb_recoverd.c:1312 Starting do_recovery

Node 1 receives a leader broadcast (note that node 2 is still disconnected):

  2023-06-12T19:39:39.685181+08:00 host-192-168-40-132 ctdb-recoverd[659348]:
Received leader broadcast, leader=0

Nodes 0 &amp; 1 notice the return of node 2:

  2023-06-12T19:39:41.420521+08:00 host-192-168-40-131 ctdbd[2853172]:
192.136.40.131:4379: connected to 192.136.40.133:4379 - 2 connected
  2023-06-12T19:39:42.013100+08:00 host-192-168-40-132 ctdbd[659192]:
192.136.40.132:4379: connected to 192.136.40.133:4379 - 2 connected

Node 2 things it is still leader, so sends a leader broadcast,
received by node 0:

  2023-06-12T19:39:42.617032+08:00 host-192-168-40-132 ctdb-recoverd[659348]:
Received leader broadcast, leader=2

Note that there aren't enough logs from node 2 to know if it also
received a leader broadcast from node 0.

Now node 2 is gone again:

  2023-06-12T19:39:51.420623+08:00 host-192-168-40-131 ctdbd[2853172]: dead
count reached for node 2

This is only 10s after it was rejoined, so KeepaliveInterval and/or
KeepaliveLimit must be greatly reduced from the defaults.  It is hard
to understand what is happening without knowing which configuration
values have been changed from the defaults.

I can see 1 main problem here: you have an unresolved split brain.
Node 2 does not realise it has lost the cluster lock.  It still thinks
it is leader while node 0 is able to take the lock.  It seems like you
need to reduce timeouts on the cluster lock.  For the standard fcntl
helper, we added a configurable re-check time.

The general situation could be helped by introducing a stronger notion
of cluster membership (where membership is granted by the leader).  At
the moment nodes join the cluster when they are connected and this is
too weak.  Fixing this requires a major redesign, of the type we have
spoken about at SambaXP for years but have not yet had time to do.

By the way, putting "CTDB" in the subject is good because I don't
read
anything on the general samba list that doesn't have CTDB in the
subject.  Also, if you post a merge request for CTDB, please add me as
a reviewer so I see it.  I don't get time to check all merge requests.

Thanks...

peace &amp; happiness,
martin

Martin Schwenke

2023-Jun-18 04:47 UTC

head link

[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting

Hi tuqp,

On Thu, 15 Jun 2023 15:48:52 +0800, "tu.qiuping" <tu.qiuping at
qq.com>
wrote:
> I'm sorry that I missed the log of node 2. The log of node 2 after
> 2023-06-12T19:39:40.086620 is as follows:
No problem.  Thanks for posting it.  However, it doesn't seem to tell us
anything new...
> As you can see from the log, node 2 did not receive the broadcast from node
0.
I think the main problem is here:
>   2023-06-12T19:39:42.617032+08:00 host-192-168-40-132
ctdb-recoverd[659348]: Received leader broadcast, leader=2
where node 0 believes node 2 is the leader even though it is the
leader.  This can only happen if the cluster lock is not configured in
a reliable way.  More on this below.
> At the same time,&nbsp;my CTDB cluster is setup as follows:
> MonitorInterval=5
> RecoveryBanPeriod=5
> KeepaliveInterval=1
> KeepaliveLimit=2
> ElectionTimeout=2
These setting are much more ambitious than I have ever seen tested
with CTDB.  You appear to be trying to achieve very fast failover:
MonitorInterval, KeepaliveInterval and KeepaliveLimit seem to be set to
very low values.  What workload requires such fast failover?

In "Designing Data-Intensive Applications", Martin Kleppmann says:

  If a timeout is the only sure way of detecting a fault, then how long
  should the timeout be? There is unfortunately no simple answer.

  A long timeout means a long wait until a node is declared dead (and
  during this time, users may have to wait or see error messages). A
  short timeout detects faults faster, but carries a higher risk of
  incorrectly declaring a node dead when in fact it has only suffered a
  temporary slowdown (e.g., due to a load spike on the node or the
  network).

  Prematurely declaring a node dead is problematic: if the node is
  actually alive and in the middle of performing some action (for
  example, sending an email), and another node takes over, the action
  may end up being performed twice. We will discuss this issue in more
  detail in ?Knowledge, Truth, and Lies?, and in Chapters 9 and 11.

I would say the same advice applies to short network failures.

If you haven't read this book, I encourage you to do so, especially
Part II. Distributed Data.  If nothing else, then at least the section
on Unreliable Networks.  :-)

In CTDB, banning is meant to exclude a misbehaving node from the
cluster. If you set RecoveryBanPeriod to only 5s then if a node is
misbehaving due to overload then it is unlikely to have time to return
to a useful state.
> And the renew time of the cluster lock is 4s, and the default values
> are used for other settings.
Do you see a problem on every failover, or just sometimes?

The default setting for "leader timeout" is 5.  With the above
Keepalive* settings, I am surprised to see a leader broadcast timeout
logged before nodes notice they are disconnected.  Is "leader timeout"
definitely set to the default value of 5?

I definitely think that the cluster lock needs to be checked more often
than every 5s.  Apparently you are renewing it every 4s.

Aa leader broadcast timeout has occurred at 19:39:32.604, suggesting
that node 2 is disconnected.   Node 0 can take the cluster lock at
19:39:39.229, so it seems odd that node 2 still thinks it is leader at
19:39:42.617 when node 0 receives a leader broadcast from it.

Perhaps Ceph is failing over too slowly?

Note that the lock is only release by node 2 here:

    2023-06-12T19:39:42.866980+08:00 host-192-168-40-133 ctdbd[2520003]: Lock
contention during renew: -16
    2023-06-12T19:39:42.888760+08:00 host-192-168-40-133 ctdbd[2520003]:
/usr/libexec/ctdb/ctdb_mutex_clove_rados_helper: Failed to drop lock on RADOS
object 'ctdb_reclock' - (No such file or directory)
    2023-06-12T19:39:42.894979+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Cluster lock helper terminated
    2023-06-12T19:39:42.895168+08:00 host-192-168-40-133 ctdb-recoverd[2520071]:
Start election

For things to work properly, it needs to be released faster than this.
> And, I use the following command to test node 2 for network
> vibration.
> ==&gt; for i in {1..1000};do ifdown bond1.906&amp;&amp;sleep
5s&amp;&amp;ifup bond1.906&amp;&amp;sleep 5s;done
I think continuously testing network up and down every 5s is probably
too ambitious.  If I had a network that continuously went up and down
that frequently then I would treat this as the normal state and
attempt to configure the cluster to take longer to failover!  Well,
honestly, I would not use that network.
> It should be noted that interface bond1.906 contains private
> addresses and public addresses of CTDB
While this isn't recommended, I can't see it causing a problem with
your test.  However, it might be worth trying to do this test by
controlling switch ports instead.

I'm not sure there is much more I can say...

peace & happiness,
martin

samba - Jun 2023 - CTDB: test network oscillations on the leader node, resulting in brain splitting

[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting

[Samba] CTDB: test network oscillations on the leader node, resulting in brain splitting