Hi,
We are using CTDB version 1.0.77 and yesterday we saw an instance of node
running into issues and banning itself to recover (as listed below):
node1:
2009/07/29 23:23:37.748251 [22371]: Banning node 0 for 300 seconds
2009/07/29 23:23:37.748263 [22371]: self ban - lowering our election
priority
2009/07/29 23:23:37.748503 [22275]: This node has been banned - forcing
freeze and recovery
Now other nodes part of CTDB cluster receives the ban message, but even
though the ID does not belong to its CURRENT ID, other nodes bans itself and
goes into recovery mode. I guess this is not supposed to happen?
node2 (should not ban itself):
2009/07/29 23:23:37.748659 [19905]: Got a ban request for pnn:0 but our pnn
is 1. Ignoring ban request
2009/07/29 23:23:37.748994 [19776]: This node has been banned - forcing
freeze and recovery
node3 (should not ban itself):
2009/07/29 23:23:37.748506 [19892]: Got a ban request for pnn:0 but our pnn
is 2. Ignoring ban request
2009/07/29 23:23:37.749575 [19750]: This node has been banned - forcing
freeze and recovery
Existing Version 1.0.77: ctdb-1.0.77/ctdb_monitor.c
241 if ((node->flags & NODE_FLAGS_BANNED) &&
!(c->old_flags &
NODE_FLAGS _BANNED)) {
242 /* make sure we are frozen */
243 DEBUG(DEBUG_NOTICE,("This node has been banned -
forcing
fre eze and recovery\n"));
--
I see a condition added in the "ban algorithm" in the latest 1.0.88
to
ensure the banned node ID matches with node's PNN ID ((node->pnn
=ctdb->pnn))
--
Version 1.0.88:
311 /* if we have become banned, we should go into recovery mode */
312 if ((node->flags & NODE_FLAGS_BANNED) &&
!(c->old_flags &
NODE_FLAGS _BANNED) && (node->pnn == ctdb->pnn)) {
313 /* make sure we are frozen */
314 DEBUG(DEBUG_NOTICE,("This node has been banned -
forcing
fre eze and recovery\n"));
Can you please confirm if upgrading to 1.0.88 would fix the issue of a node
getting banned does not cause banning of other nodes, unnecessarily?
Thanks,
-Tim