Hello,
I've got a two node CTDB/Samba cluster that I'm having trouble with
trying
to add back a node after having to do an OS reload on it. The servers are
running CTDB 2.5.1 and Samba 4.1.4 on AIX 7.1 TL2. The Samba CTDB
databases and Samba service work fine from the node that was not reloaded.
The rebuilt node is failing to re-add itself to the cluster. I'm looking
for pointers on what to look for in the ctdb log, or what increased log
levels I should set to find the cause.
On the working node I can start CTDB and have it go into OK status:
> ctdb status
Number of nodes:2
pnn:0 192.168.20.1 OK (THIS NODE)
pnn:1 192.168.20.2 DISCONNECTED|UNHEALTHY|INACTIVE
Generation:349517425
Size:1
hash:0 lmaster:0
Recovery mode:NORMAL (0)
Recovery master:0
The ctdb.log has:
2014/02/26 11:58:55.637024 [12452056]: Recovery lock file set to "".
Disabling recovery lock checking
2014/02/26 11:58:55.648102 [6553670]: Starting CTDBD (Version 2.5.1) as
PID: 6553670
2014/02/26 11:58:55.768838 [6553670]: Vacuuming is disabled for persistent
database account_policy.tdb
2014/02/26 11:58:55.817376 [6553670]: Vacuuming is disabled for persistent
database group_mapping.tdb
2014/02/26 11:58:55.892066 [6553670]: Vacuuming is disabled for persistent
database passdb.tdb
2014/02/26 11:58:55.939510 [6553670]: Vacuuming is disabled for persistent
database registry.tdb
2014/02/26 11:58:55.986739 [6553670]: Vacuuming is disabled for persistent
database secrets.tdb
2014/02/26 11:58:56.034569 [6553670]: Vacuuming is disabled for persistent
database share_info.tdb
2014/02/26 11:58:56.034715 [6553670]: Freeze priority 1
2014/02/26 11:58:56.040888 [6553670]: Freeze priority 2
2014/02/26 11:58:56.043987 [6553670]: Freeze priority 3
2014/02/26 11:58:56.437970 [6553670]: Freeze priority 1
2014/02/26 11:58:56.438152 [6553670]: Freeze priority 2
2014/02/26 11:58:56.438292 [6553670]: Freeze priority 3
2014/02/26 11:58:59.440570 [recoverd:12320780]: Trigger takeoverrun
2014/02/26 11:58:59.442438 [6553670]: Freeze priority 1
2014/02/26 11:58:59.442598 [6553670]: Freeze priority 2
2014/02/26 11:58:59.442742 [6553670]: Freeze priority 3
2014/02/26 11:59:00.186508 [6553670]: Thawing priority 1
2014/02/26 11:59:00.186610 [6553670]: Release freeze handler for prio 1
2014/02/26 11:59:00.186691 [6553670]: Thawing priority 2
2014/02/26 11:59:00.186737 [6553670]: Release freeze handler for prio 2
2014/02/26 11:59:00.186793 [6553670]: Thawing priority 3
2014/02/26 11:59:00.186838 [6553670]: Release freeze handler for prio 3
2014/02/26 11:59:00.554819 [recoverd:12320780]: Resetting ban count to 0
for all nodes
2014/02/26 11:59:16.834495 [6553670]: Node became HEALTHY. Ask recovery
master 0 to perform ip reallocation
When I start ctdbd on the rebuilt node I see the following at the
beginning of the ctdb log:
2014/02/26 12:10:28.018036 [11141228]: Recovery lock file set to "".
Disabling recovery lock checking
2014/02/26 12:10:28.030003 [10027232]: Starting CTDBD (Version 2.5.1) as
PID: 10027232
2014/02/26 12:10:28.152649 [10027232]: Vacuuming is disabled for
persistent database account_policy.tdb
2014/02/26 12:10:28.204170 [10027232]: Vacuuming is disabled for
persistent database group_mapping.tdb
2014/02/26 12:10:28.255586 [10027232]: Vacuuming is disabled for
persistent database passdb.tdb
2014/02/26 12:10:28.307094 [10027232]: Vacuuming is disabled for
persistent database registry.tdb
2014/02/26 12:10:28.358609 [10027232]: Vacuuming is disabled for
persistent database secrets.tdb
2014/02/26 12:10:28.410227 [10027232]: Vacuuming is disabled for
persistent database share_info.tdb
2014/02/26 12:10:28.410334 [10027232]: Freeze priority 1
2014/02/26 12:10:28.414451 [10027232]: Freeze priority 2
2014/02/26 12:10:28.417383 [10027232]: Freeze priority 3
2014/02/26 12:10:30.173294 [10027232]: Freeze priority 1
2014/02/26 12:10:30.173525 [10027232]: Freeze priority 2
2014/02/26 12:10:30.173710 [10027232]: Freeze priority 3
2014/02/26 12:10:33.500359 [10027232]: Freeze priority 1
2014/02/26 12:10:33.639712 [10027232]: Freeze priority 2
2014/02/26 12:10:33.677081 [recoverd:11272374]:
server/ctdb_recoverd.c:3699 Current recmaster node 0 does not have
CAP_RECMASTER, but we (node 1) have - force an election
2014/02/26 12:10:33.677186 [10027232]: Freeze priority 1
2014/02/26 12:10:33.677336 [10027232]: Freeze priority 2
2014/02/26 12:10:33.677491 [10027232]: Freeze priority 3
2014/02/26 12:10:33.789645 [10027232]: Freeze priority 3
2014/02/26 12:10:35.132212 [10027232]: common/ctdb_ltdb.c:211 Failed to
store dynamic data
2014/02/26 12:10:35.301494 [10027232]:
/usr/smb_cluster/var/lib/ctdb/persistent/passdb.tdb.1:tdb_wipe_all: failed
to write hash 0
2014/02/26 12:10:35.301555 [10027232]: server/ctdb_freeze.c:470 Failed to
wipe database for db 'passdb.tdb'
2014/02/26 12:10:37.310573 [10027232]: Freeze priority 1
2014/02/26 12:10:37.311101 [10027232]: Freeze priority 2
2014/02/26 12:10:37.311576 [10027232]: Freeze priority 3
2014/02/26 12:10:37.452567 [10027232]: Failed to open event script
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:10:37.452633 [10027232]: server/ctdb_recover.c:1006 Failed
to start recovery
2014/02/26 12:10:37.452703 [10027232]: pnn 1 Invalid reqid 0 in
ctdb_reply_control
2014/02/26 12:10:40.177172 [10027232]: read error alloc failed for 64
2014/02/26 12:11:37.459277 [10027232]: Freeze priority 1
2014/02/26 12:11:37.459742 [10027232]: Freeze priority 2
2014/02/26 12:11:37.460226 [10027232]: Freeze priority 3
2014/02/26 12:11:37.583899 [10027232]: Failed to open event script
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:11:37.583952 [10027232]: server/ctdb_recover.c:1006 Failed
to start recovery
2014/02/26 12:11:37.584015 [10027232]: pnn 1 Invalid reqid 0 in
ctdb_reply_control
2014/02/26 12:12:37.592541 [10027232]: Freeze priority 1
2014/02/26 12:12:37.593051 [10027232]: Freeze priority 2
2014/02/26 12:12:37.593522 [10027232]: Freeze priority 3
2014/02/26 12:12:37.728160 [10027232]: Failed to open event script
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:12:37.728215 [10027232]: server/ctdb_recover.c:1006 Failed
to start recovery
2014/02/26 12:12:37.728280 [10027232]: pnn 1 Invalid reqid 0 in
ctdb_reply_control
2014/02/26 12:13:37.735576 [10027232]: Freeze priority 1
2014/02/26 12:13:37.736043 [10027232]: Freeze priority 2
2014/02/26 12:13:37.736500 [10027232]: Freeze priority 3
2014/02/26 12:13:37.856533 [10027232]: Failed to open event script
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:13:37.856639 [10027232]: server/ctdb_recover.c:1006 Failed
to start recovery
2014/02/26 12:13:37.856705 [10027232]: pnn 1 Invalid reqid 0 in
ctdb_reply_control
2014/02/26 12:14:37.858397 [10027232]: Banning this node for 300 seconds
2014/02/26 12:14:38.033523 [10027232]: server/ctdb_monitor.c:485 Node 0
became healthy - force recovery for startup
2014/02/26 12:14:38.581971 [recoverd:11272374]: Refusing to disable
takeover runs on inactive node
2014/02/26 12:14:38.702472 [10027232]: Failed to open event script
directory '/usr/smb_cluster/etc/ctdb/events.d'
2014/02/26 12:14:38.702524 [10027232]: Failed to run "ipreallocated"
event
2014/02/26 12:15:37.737758 [10027232]: server/ctdb_recover.c:562 Been in
recovery mode for too long. Dropping all IPS
Back on the first node I see the following added:
2014/02/26 12:10:33.364973 [6553670]: Freeze priority 1
2014/02/26 12:10:33.515860 [6553670]: Freeze priority 2
2014/02/26 12:10:33.657076 [6553670]: Freeze priority 3
2014/02/26 12:10:35.301189 [recoverd:12320780]: Async operation failed
with ret=0 res=-1 opcode=67
2014/02/26 12:10:35.301275 [recoverd:12320780]: Async wait failed -
fail_count=1
2014/02/26 12:10:35.301327 [recoverd:12320780]:
server/ctdb_recoverd.c:1468 Unable to wipe database. Recovery failed.
2014/02/26 12:10:35.301968 [recoverd:12320780]:
server/ctdb_recoverd.c:1915 Failed to recover database 0x3ef19640
I'm not sure which of these messages is important to look at. The event
script directory does exists and contains only 00.ctdb. I'm not using
CTDB to manage IP addresses as I have an IP load balancer in front of the
servers similar to LVS. Should I be focusing on the error handling the
ipreallocated event or on the error regarding the passdb.tdb earlier on in
the log. I have had a few clusters with this setup for a few years but
this is the first cluster at CTDB 2.5.1 and Samba 4.1.X. I don't think
I've ever had to rebuild a node before so I can't say if my older
clusters
have this issue or not.
John Janosik
jpjanosi at us.ibm.com