zhu.shangzhong at zte.com.cn
2018-Sep-05 08:29 UTC
[Samba] [ctdb]Unable to run startrecovery event(if mail content is encrypted, please see the attached file)
There is a 3 nodes ctdb cluster is running. When one of 3 nodes is powered down, lots of logs will be wrote to log.ctdb. node1: repeat logs: 2018/09/04 04:35:06.414369 ctdbd[10129]: Recovery has started 2018/09/04 04:35:06.414944 ctdbd[10129]: connect() failed, errno=111 2018/09/04 04:35:06.415076 ctdbd[10129]: Unable to run startrecovery event node2: repeat logs: 2018/09/04 04:35:09.412368 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:09.412689 ctdb-recoverd[9437]: Already holding recovery lock 2018/09/04 04:35:09.412700 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1326 Recovery initiated due to problem with node 1 2018/09/04 04:35:09.412974 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1351 Recovery - created remote databases 2018/09/04 04:35:09.413319 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1380 Recovery - updated flags 2018/09/04 04:35:09.417616 ctdb-recovery[1562]: set recovery mode to ACTIVE 2018/09/04 04:35:09.417765 ctdbd[8488]: Recovery has started 2018/09/04 04:35:09.722686 ctdb-recovery[1562]: failed to run start_recovery event on node 1, ret=-1 2018/09/04 04:35:09.722732 ctdb-recovery[1562]: database recovery failed, ret=-1 What happened? e.g. node1: IP: 10.231.8.65 VIP: 10.231.8.68 node2: IP: 10.231.8.66 VIP: 10.231.8.69 node3: IP: 10.231.8.67 VIP: 10.231.8.70 Power down node3 The node1 log is as follow: 2018/09/04 04:29:33.402108 ctdbd[10129]: 10.231.8.65:4379: node 10.231.8.67:4379 is dead: 1 connected 2018/09/04 04:29:33.414817 ctdbd[10129]: Tearing down connection to dead node :0 2018/09/04 04:29:33.420483 ctdbd[10129]: Recovery mode set to ACTIVE 2018/09/04 04:29:33.422946 ctdb-recoverd[11302]: ctdb_control error: 'node is disconnected' 2018/09/04 04:29:33.438802 ctdb-recoverd[11302]: ../ctdb/client/ctdb_client.c:1584 ctdb_control for getnodes failed ret:-1 res:-1 2018/09/04 04:29:33.438814 ctdb-recoverd[11302]: ../ctdb/server/ctdb_recoverd.c:2515 Unable to get nodemap from recovery master 0 2018/09/04 04:29:36.440112 ctdb-recoverd[11302]: Election period ended 2018/09/04 04:29:52.465429 ctdb-recoverd[11302]: Recmaster node 2 is inactive. Force election 2018/09/04 04:29:52.465663 ctdbd[10129]: This node (1) is now the recovery master 2018/09/04 04:29:55.468771 ctdb-recoverd[11302]: Election period ended 2018/09/04 04:29:55.469404 ctdb-recoverd[11302]: Node 2 has changed flags - now 0x8 was 0x0 2018/09/04 04:29:55.469475 ctdb-recoverd[11302]: Remote node 2 had flags 0x8, local had 0x0 - updating local 2018/09/04 04:29:55.469514 ctdb-recoverd[11302]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:29:55.469525 ctdb-recoverd[11302]: Attempting to take recovery lock (/share-fs/export/ctdb/.ctdb/reclock) 2018/09/04 04:29:55.563522 ctdb-recoverd[11302]: Unable to take recovery lock - contention 2018/09/04 04:29:55.563573 ctdb-recoverd[11302]: Unable to get recovery lock - aborting recovery and ban ourself for 300 seconds 2018/09/04 04:29:55.563585 ctdb-recoverd[11302]: Banning node 1 for 300 seconds 2018/09/04 04:29:55.563638 ctdbd[10129]: Banning this node for 300 seconds 2018/09/04 04:29:55.563674 ctdbd[10129]: This node has been banned - releasing all public IPs and setting the generation to INVALID. 2018/09/04 04:29:55.570212 ctdb-eventd[10131]: Bad talloc magic value - wrong talloc version used/mixed 2018/09/04 04:29:57.240533 ctdbd[10129]: Eventd went away 2018/09/04 04:30:25.593641 ctdb-recoverd[11302]: ../ctdb/client/ctdb_client.c:1065 control timed out. reqid:50956260 opcode:109 dstnode:1 2018/09/04 04:30:25.594181 ctdb-recoverd[11302]: ../ctdb/client/ctdb_client.c:1178 ctdb_control_recv failed 2018/09/04 04:30:25.594199 ctdb-recoverd[11302]: ../ctdb/client/ctdb_client.c:4325 ctdb_control for set ban state failed 2018/09/04 04:30:25.594210 ctdb-recoverd[11302]: ../ctdb/server/ctdb_recoverd.c:292 Failed to ban node 1 2018/09/04 04:30:25.594541 ctdbd[10129]: Freeze all 2018/09/04 04:30:25.594570 ctdbd[10129]: Freeze db: ctdb.tdb 2018/09/04 04:30:55.564600 ctdb-recoverd[11302]: ../ctdb/client/ctdb_client.c:1010 reqid 50956260 not found 2018/09/04 04:31:33.420908 ctdbd[10129]: ../ctdb/server/ctdb_recover.c:821 Been in recovery mode for too long. Dropping all IPS 2018/09/04 04:31:33.421075 ctdbd[10129]: ../ctdb/server/ctdb_takeover.c:1584 Not releasing IP 10.231.8.70/27 on interface bond1.826, an update is already in progess 2018/09/04 04:31:33.421127 ctdbd[10129]: ../ctdb/server/ctdb_takeover.c:1618 Released 0 public IPs 2018/09/04 04:34:52.390122 ctdbd[10129]: Remote node (2) is now the recovery master 2018/09/04 04:34:55.391432 ctdb-recoverd[11302]: Election period ended 2018/09/04 04:34:55.393704 ctdb-recoverd[11302]: Node 1 has changed flags - now 0x8 was 0x0 2018/09/04 04:34:55.420135 ctdb-recoverd[11302]: Node 2 has changed flags - now 0x0 was 0x8 2018/09/04 04:34:55.564161 ctdbd[10129]: Banning timed out 2018/09/04 04:34:56.294067 ctdb-recoverd[11302]: Disabling takeover runs for 60 seconds 2018/09/04 04:34:56.310695 ctdbd[10129]: Release of IP 10.231.8.70/27 rejected update for this IP already in flight 2018/09/04 04:34:56.313720 ctdb-recoverd[11302]: Reenabling takeover runs 2018/09/04 04:34:56.394669 ctdb-recoverd[11302]: Node 1 has changed flags - now 0x0 was 0x8 2018/09/04 04:35:06.414369 ctdbd[10129]: Recovery has started 2018/09/04 04:35:06.414944 ctdbd[10129]: connect() failed, errno=111 2018/09/04 04:35:06.415076 ctdbd[10129]: Unable to run startrecovery event 2018/09/04 04:35:07.415352 ctdbd[10129]: Recovery has started 2018/09/04 04:35:07.415538 ctdbd[10129]: connect() failed, errno=111 2018/09/04 04:35:07.415561 ctdbd[10129]: Unable to run startrecovery event 2018/09/04 04:35:08.416875 ctdbd[10129]: Recovery has started 2018/09/04 04:35:08.417078 ctdbd[10129]: connect() failed, errno=111 2018/09/04 04:35:08.417102 ctdbd[10129]: Unable to run startrecovery event 2018/09/04 04:35:09.417794 ctdbd[10129]: Recovery has started 2018/09/04 04:35:09.417980 ctdbd[10129]: connect() failed, errno=111 2018/09/04 04:35:09.418020 ctdbd[10129]: Unable to run startrecovery event 2018/09/04 04:35:10.417816 ctdbd[10129]: Recovery has started 2018/09/04 04:35:10.418031 ctdbd[10129]: connect() failed, errno=111 2018/09/04 04:35:10.418051 ctdbd[10129]: Unable to run startrecovery event ...... The node2 log is as follow: 2018/09/04 04:29:33.402123 ctdbd[8488]: 10.231.8.66:4379: node 10.231.8.67:4379 is dead: 1 connected 2018/09/04 04:29:33.402186 ctdbd[8488]: Tearing down connection to dead node :0 2018/09/04 04:29:33.420074 ctdb-recoverd[9437]: ctdb_control error: 'node is disconnected' 2018/09/04 04:29:33.420124 ctdb-recoverd[9437]: ../ctdb/client/ctdb_client.c:1584 ctdb_control for getnodes failed ret:-1 res:-1 2018/09/04 04:29:33.420136 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:2515 Unable to get nodemap from recovery master 0 2018/09/04 04:29:33.420433 ctdb-recoverd[9437]: Current recmaster node 0 does not have CAP_RECMASTER, but we (node 2) have - force an election 2018/09/04 04:29:33.420476 ctdbd[8488]: Recovery mode set to ACTIVE 2018/09/04 04:29:33.420599 ctdbd[8488]: This node (2) is now the recovery master 2018/09/04 04:29:36.424544 ctdb-recoverd[9437]: Election period ended 2018/09/04 04:29:36.425227 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:29:36.425245 ctdb-recoverd[9437]: Attempting to take recovery lock (/share-fs/export/ctdb/.ctdb/reclock) 2018/09/04 04:29:51.633079 ctdb-recoverd[9437]: Unable to take recovery lock - contention 2018/09/04 04:29:51.633146 ctdb-recoverd[9437]: Unable to get recovery lock - aborting recovery and ban ourself for 300 seconds 2018/09/04 04:29:51.633162 ctdb-recoverd[9437]: Banning node 2 for 300 seconds 2018/09/04 04:29:51.633229 ctdbd[8488]: Banning this node for 300 seconds 2018/09/04 04:29:51.633255 ctdbd[8488]: This node has been banned - releasing all public IPs and setting the generation to INVALID. 2018/09/04 04:29:51.998861 ctdb-eventd[8489]: 60.nfs: method return sender=:1.288347 -> dest=:1.479075 reply_serial=2 2018/09/04 04:29:51.998902 ctdb-eventd[8489]: 60.nfs: boolean true 2018/09/04 04:29:51.998914 ctdb-eventd[8489]: 60.nfs: string "Started grace period" 2018/09/04 04:29:52.048126 ctdbd[8488]: ../ctdb/server/ctdb_takeover.c:1618 Released 1 public IPs 2018/09/04 04:29:52.048386 ctdbd[8488]: Freeze all 2018/09/04 04:29:52.048403 ctdbd[8488]: Freeze db: ctdb.tdb 2018/09/04 04:29:52.465899 ctdbd[8488]: Remote node (1) is now the recovery master 2018/09/04 04:29:55.466276 ctdb-recoverd[9437]: Election period ended 2018/09/04 04:29:55.469674 ctdb-recoverd[9437]: Node 2 has changed flags - now 0x8 was 0x0 2018/09/04 04:31:33.420952 ctdbd[8488]: ../ctdb/server/ctdb_recover.c:821 Been in recovery mode for too long. Dropping all IPS 2018/09/04 04:31:33.421619 ctdbd[8488]: ../ctdb/server/ctdb_takeover.c:1618 Released 0 public IPs 2018/09/04 04:34:51.633451 ctdbd[8488]: Banning timed out 2018/09/04 04:34:52.389739 ctdb-recoverd[9437]: Recmaster node 1 is inactive. Force election 2018/09/04 04:34:52.389840 ctdbd[8488]: This node (2) is now the recovery master 2018/09/04 04:34:55.392938 ctdb-recoverd[9437]: Election period ended 2018/09/04 04:34:55.393638 ctdb-recoverd[9437]: Node 1 has changed flags - now 0x8 was 0x0 2018/09/04 04:34:55.393724 ctdb-recoverd[9437]: Remote node 1 had flags 0x8, local had 0x0 - updating local 2018/09/04 04:34:55.393761 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:34:55.393773 ctdb-recoverd[9437]: Attempting to take recovery lock (/share-fs/export/ctdb/.ctdb/reclock) 2018/09/04 04:34:55.419681 ctdb-recoverd[9437]: Recovery lock taken successfully by recovery daemon 2018/09/04 04:34:55.419707 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1326 Recovery initiated due to problem with node 0 2018/09/04 04:34:55.419743 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1351 Recovery - created remote databases 2018/09/04 04:34:55.420183 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1380 Recovery - updated flags 2018/09/04 04:34:55.420296 ctdb-recoverd[9437]: Set recovery_helper to "/usr/libexec/ctdb/ctdb_recovery_helper" 2018/09/04 04:34:55.437949 ctdb-recovery[903]: set recovery mode to ACTIVE 2018/09/04 04:34:55.438059 ctdbd[8488]: Recovery has started 2018/09/04 04:34:55.768894 ctdb-recovery[903]: start_recovery event finished 2018/09/04 04:34:55.768987 ctdb-recovery[903]: updated VNNMAP 2018/09/04 04:34:55.769012 ctdb-recovery[903]: recover database 0x6645c6c4 2018/09/04 04:34:55.769099 ctdbd[8488]: Freeze db: ctdb.tdb frozen 2018/09/04 04:34:55.825711 ctdb-recovery[903]: Pull persistent db ctdb.tdb from node 2 with seqnum 0x0 2018/09/04 04:34:55.826195 ctdb-recovery[903]: Pulled 1 records for db ctdb.tdb from node 2 2018/09/04 04:34:55.904690 ctdb-recovery[903]: Wrote 1 buffers of recovery records for ctdb.tdb 2018/09/04 04:34:55.904779 ctdb-recovery[903]: Pushing buffer 0 with 1 records for ctdb.tdb 2018/09/04 04:34:55.904838 ctdb-recovery[903]: Pushed 1 records for db ctdb.tdb 2018/09/04 04:34:55.985856 ctdbd[8488]: Thaw db: ctdb.tdb generation 952840072 2018/09/04 04:34:55.985883 ctdbd[8488]: Release freeze handle for db ctdb.tdb 2018/09/04 04:34:55.986181 ctdb-recovery[903]: 1 of 1 databases recovered 2018/09/04 04:34:55.986233 ctdbd[8488]: Recovery mode set to NORMAL 2018/09/04 04:34:55.986247 ctdbd[8488]: Thawing all 2018/09/04 04:34:55.986256 ctdbd[8488]: Release freeze handle 2018/09/04 04:34:55.991557 ctdb-recovery[903]: set recovery mode to NORMAL 2018/09/04 04:34:55.991634 ctdbd[8488]: Recovery has finished 2018/09/04 04:34:56.293808 ctdb-recovery[903]: recovered event finished 2018/09/04 04:34:56.293880 ctdb-recoverd[9437]: Takeover run starting 2018/09/04 04:34:56.296633 ctdb-recoverd[9437]: Set takeover_helper to "/usr/libexec/ctdb/ctdb_takeover_helper" 2018/09/04 04:34:56.310786 ctdb-takeover[975]: RELEASE_IP 10.231.8.70 failed on node 1, ret=-1 2018/09/04 04:34:56.310873 ctdb-takeover[975]: Assigning banning credits to node 1 2018/09/04 04:34:56.310902 ctdb-takeover[975]: takeover run failed, ret=-1 2018/09/04 04:34:56.313645 ctdb-recoverd[9437]: ctdb_takeover_run() failed 2018/09/04 04:34:56.313663 ctdb-recoverd[9437]: Takeover run unsuccessful 2018/09/04 04:34:56.313675 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1398 Recovery complete 2018/09/04 04:34:56.313684 ctdb-recoverd[9437]: Resetting ban count to 0 for all nodes 2018/09/04 04:34:56.313694 ctdb-recoverd[9437]: Just finished a recovery. New recoveries will now be supressed for the rerecovery timeout (10 seconds) 2018/09/04 04:34:56.313703 ctdb-recoverd[9437]: Disabling recoveries for 10 seconds 2018/09/04 04:34:56.394112 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:34:56.394234 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:34:56.394538 ctdb-recoverd[9437]: Node 1 has changed flags - now 0x0 was 0x8 2018/09/04 04:34:56.394624 ctdb-recoverd[9437]: Remote node 1 had flags 0x0, local had 0x8 - updating local 2018/09/04 04:34:56.394943 ctdb-recoverd[9437]: Node:1 was in recovery mode. Start recovery process 2018/09/04 04:34:56.394956 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:34:56.394965 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:34:57.395556 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:34:57.395750 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:34:57.396037 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:34:57.396051 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:34:58.396533 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:34:58.396674 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:34:58.397020 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:34:58.397033 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:34:59.398608 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:34:59.398816 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:34:59.399108 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:34:59.399121 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:35:00.399586 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:35:00.399750 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:35:00.400040 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:00.400054 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:35:01.401503 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:35:01.401718 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:35:01.402002 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:01.402018 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:35:02.403450 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:35:02.403592 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:35:02.403939 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:02.403952 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:35:03.404425 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:35:03.404555 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:35:03.404875 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:03.404887 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:35:04.406400 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:35:04.406532 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:35:04.406781 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:04.406794 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:35:05.408287 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:35:05.408466 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:35:05.408709 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:05.408722 ctdb-recoverd[9437]: Unable to begin - recoveries are disabled 2018/09/04 04:35:06.314014 ctdb-recoverd[9437]: Reenabling recoveries after timeout 2018/09/04 04:35:06.408650 ctdb-recoverd[9437]: Unassigned IP 10.231.8.69 can be served by this node 2018/09/04 04:35:06.408823 ctdb-recoverd[9437]: Trigger takeoverrun 2018/09/04 04:35:06.409201 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:06.409216 ctdb-recoverd[9437]: Already holding recovery lock 2018/09/04 04:35:06.409226 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1326 Recovery initiated due to problem with node 1 2018/09/04 04:35:06.409498 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1351 Recovery - created remote databases 2018/09/04 04:35:06.410074 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1380 Recovery - updated flags 2018/09/04 04:35:06.414075 ctdbd[8488]: Recovery mode set to ACTIVE 2018/09/04 04:35:06.414218 ctdb-recovery[1454]: set recovery mode to ACTIVE 2018/09/04 04:35:06.414344 ctdbd[8488]: Recovery has started 2018/09/04 04:35:06.725031 ctdb-recovery[1454]: failed to run start_recovery event on node 1, ret=-1 2018/09/04 04:35:06.725081 ctdb-recovery[1454]: database recovery failed, ret=-1 2018/09/04 04:35:07.410233 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:07.410274 ctdb-recoverd[9437]: Already holding recovery lock 2018/09/04 04:35:07.410286 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1326 Recovery initiated due to problem with node 1 2018/09/04 04:35:07.410546 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1351 Recovery - created remote databases 2018/09/04 04:35:07.411074 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1380 Recovery - updated flags 2018/09/04 04:35:07.415209 ctdb-recovery[1497]: set recovery mode to ACTIVE 2018/09/04 04:35:07.415329 ctdbd[8488]: Recovery has started 2018/09/04 04:35:07.718345 ctdb-recovery[1497]: failed to run start_recovery event on node 1, ret=-1 2018/09/04 04:35:07.718386 ctdb-recovery[1497]: database recovery failed, ret=-1 2018/09/04 04:35:08.411549 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:08.411591 ctdb-recoverd[9437]: Already holding recovery lock 2018/09/04 04:35:08.411601 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1326 Recovery initiated due to problem with node 1 2018/09/04 04:35:08.411870 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1351 Recovery - created remote databases 2018/09/04 04:35:08.412529 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1380 Recovery - updated flags 2018/09/04 04:35:08.416727 ctdb-recovery[1529]: set recovery mode to ACTIVE 2018/09/04 04:35:08.416850 ctdbd[8488]: Recovery has started 2018/09/04 04:35:08.721065 ctdb-recovery[1529]: failed to run start_recovery event on node 1, ret=-1 2018/09/04 04:35:08.721110 ctdb-recovery[1529]: database recovery failed, ret=-1 2018/09/04 04:35:09.412368 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery 2018/09/04 04:35:09.412689 ctdb-recoverd[9437]: Already holding recovery lock 2018/09/04 04:35:09.412700 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1326 Recovery initiated due to problem with node 1 2018/09/04 04:35:09.412974 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1351 Recovery - created remote databases 2018/09/04 04:35:09.413319 ctdb-recoverd[9437]: ../ctdb/server/ctdb_recoverd.c:1380 Recovery - updated flags 2018/09/04 04:35:09.417616 ctdb-recovery[1562]: set recovery mode to ACTIVE 2018/09/04 04:35:09.417765 ctdbd[8488]: Recovery has started 2018/09/04 04:35:09.722686 ctdb-recovery[1562]: failed to run start_recovery event on node 1, ret=-1 2018/09/04 04:35:09.722732 ctdb-recovery[1562]: database recovery failed, ret=-1 ......
Martin Schwenke
2018-Sep-05 10:28 UTC
[Samba] [ctdb]Unable to run startrecovery event(if mail content is encrypted, please see the attached file)
Thanks for reporting this. It looks very interesting and we will fix it all as soon as we understand it! :-) On Wed, 5 Sep 2018 16:29:31 +0800 (CST), "zhu.shangzhong--- via samba" <samba at lists.samba.org> wrote:> There is a 3 nodes ctdb cluster is running. When one of 3 nodes is > powered down, lots of logs will be wrote to log.ctdb.Can you please let us know what version of Samba/CTDB you're using? Note that you're referring to nodes 1, 2, 3 while CTDB numbers the nodes 0, 1, 2. In fact, the situation is a little more confused than this:> Power down node3 > The node1 log is as follow: > 2018/09/04 04:29:33.402108 ctdbd[10129]: 10.231.8.65:4379: node 10.231.8.67:4379 is dead: 1 connected > 2018/09/04 04:29:33.414817 ctdbd[10129]: Tearing down connection to dead node :0It appears that the node you're calling node 3 is the one CTDB calls node 0! Can you please post the output of "ctdb status" when all nodes are up and running? I'm guessing that your nodes file looks like: 10.231.8.67 10.231.8.65 10.231.8.66 This:> node1: repeat logs: > 2018/09/04 04:35:06.414369 ctdbd[10129]: Recovery has started > 2018/09/04 04:35:06.414944 ctdbd[10129]: connect() failed, errno=111 > 2018/09/04 04:35:06.415076 ctdbd[10129]: Unable to run startrecovery eventis due to this:> 2018/09/04 04:29:55.570212 ctdb-eventd[10131]: Bad talloc magic value - wrong talloc version used/mixed > 2018/09/04 04:29:57.240533 ctdbd[10129]: Eventd went awayWe have fixed a similar issue in some versions. When we know what version you are running then we can say whether it is a known issue or a new issue. I have been working on the following issue for most of this week:> 2018/09/04 04:29:52.465663 ctdbd[10129]: This node (1) is now the recovery master > 2018/09/04 04:29:55.468771 ctdb-recoverd[11302]: Election period ended > 2018/09/04 04:29:55.469404 ctdb-recoverd[11302]: Node 2 has changed flags - now 0x8 was 0x0 > 2018/09/04 04:29:55.469475 ctdb-recoverd[11302]: Remote node 2 had flags 0x8, local had 0x0 - updating local > 2018/09/04 04:29:55.469514 ctdb-recoverd[11302]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery > 2018/09/04 04:29:55.469525 ctdb-recoverd[11302]: Attempting to take recovery lock (/share-fs/export/ctdb/.ctdb/reclock) > 2018/09/04 04:29:55.563522 ctdb-recoverd[11302]: Unable to take recovery lock - contention > 2018/09/04 04:29:55.563573 ctdb-recoverd[11302]: Unable to get recovery lock - aborting recovery and ban ourself for 300 seconds > 2018/09/04 04:29:55.563585 ctdb-recoverd[11302]: Banning node 1 for 300 secondsAre you able to recreate this every time? Sometimes? Rarely? I hadn't seen this until recently and I'm now worried that it is more widespread than we realise. Thanks... peace & happiness, martin
Reasonably Related Threads
- [ctdb]Unable to run startrecovery event(if mail contentis encrypted, please see the attached file)
- [ctdb]Unable to run startrecovery event(if mail contentis encrypted, please see the attached file)
- [ctdb]Unable to run startrecovery event
- Failed to start CTDB first time after install
- CTDB Path