Daniel Metcalfe
2015-May-19 12:54 UTC
[Samba] ctdb_client.c control timed out - banning nodes
Hello, We are using CTDB / Samba to serve a number of windows users, at this point around 1200. We have a 4 node CTDB setup. CTDB version - ctdb-1.0.114.7-1 Samba Version - sernet-samba-4.1.16-10 In recent months we've seen a big problem when 1 of the CTDB nodes is stopped or disconnected either manually or resulting from a problem. On some occasions, all other nodes get banned if a node is stopped; sometimes it's just 1 or 2 of the other nodes but it always happens to at least 1. In the logs the problems always start with "control timed out". See below: 2015/05/18 14:24:23.375195 [ 6031]: client/ctdb_client.c:759 control timed out. reqid:129 opcode:43 dstnode:0 2015/05/18 14:24:23.375232 [ 6031]: client/ctdb_client.c:870 ctdb_control_recv failed 2015/05/18 14:24:23.375240 [ 6031]: Async operation failed with state 3, opcode:0 2015/05/18 14:24:23.375257 [ 6031]: Async wait failed - fail_count=1 2015/05/18 14:24:23.375263 [ 6031]: server/ctdb_takeover.c:1475 Async control CTDB_CONTROL_RELEASE_IP failed. We saw a different pattern yesterday when the smb.conf was edited ahead of the path being created and node 0 became unhealthy for a moment. Node 0: 2015/05/18 14:24:17.694058 [ 5972]: ERROR: samba directory /path/to/directory" not available 2015/05/18 14:24:17.694435 [ 5972]: Node became UNHEALTHY. Ask recovery master 0 to perform ip reallocation 2015/05/18 14:24:23.375195 [ 6031]: client/ctdb_client.c:759 control timed out. reqid:129 opcode:43 dstnode:0 2015/05/18 14:24:23.375232 [ 6031]: client/ctdb_client.c:870 ctdb_control_recv failed 2015/05/18 14:24:23.375240 [ 6031]: Async operation failed with state 3, opcode:0 2015/05/18 14:24:23.375257 [ 6031]: Async wait failed - fail_count=1 2015/05/18 14:24:23.375263 [ 6031]: server/ctdb_takeover.c:1475 Async control CTDB_CONTROL_RELEASE_IP failed 2015/05/18 14:24:23.375272 [ 6031]: Failed to read public ips from remote node 0 2015/05/18 14:24:23.376281 [ 6031]: Inconsistent ip allocation. Trigger reallocation. 2015/05/18 14:24:23.376291 [ 6031]: Node 1 has inconsistent public ip allocation and needs update. 2015/05/18 14:24:23.697069 [ 5972]: killed 4 TCP connections to released IP 155.198.63.147 2015/05/18 14:24:24.534101 [ 5972]: Timed out killing tcp connections for IP 155.198.63.147 2015/05/18 14:24:24.541018 [ 5972]: RTNETLINK answers: Cannot assign requested address 2015/05/18 14:24:24.541916 [ 5972]: Failed to del <ip address> on dev bond2:1 2015/05/18 14:24:24.543983 [ 6031]: Could not find idr:129 2015/05/18 14:24:24.544008 [ 6031]: client/ctdb_client.c:706 reqid 129 not found 2015/05/18 14:24:25.156512 [ 6031]: Trigger takeoverrun 2015/05/18 14:24:25.781969 [ 6031]: Async operation failed with ret=0 res=-1 opcode=0 2015/05/18 14:24:25.782002 [ 6031]: Async operation failed with ret=0 res=-1 opcode=0 2015/05/18 14:24:25.782009 [ 6031]: Async wait failed - fail_count=2 2015/05/18 14:24:25.782015 [ 6031]: server/ctdb_takeover.c:1523 Async control CTDB_CONTROL_TAKEOVER_IP failed Node 1: 2015/05/18 14:24:24.919614 [29039]: common/system_linux.c:91 failed to open raw socket 2015/05/18 14:24:24.919650 [29039]: server/ctdb_takeover.c:2667 sending of gratious arp on iface 'bond2:1' failed (Too many open files) 2015/05/18 14:24:25.025629 [29039]: Failed to create pipe for vacuum child process. 2015/05/18 14:24:25.052609 [29039]: common/system_linux.c:91 failed to open raw socket 2015/05/18 14:24:25.052634 [29039]: server/ctdb_takeover.c:240 sending of arp failed on iface 'bond2:1' (Too many open files) 2015/05/18 14:24:25.052656 [29039]: common/system_linux.c:342 failed to open raw socket (Too many open files) 2015/05/18 14:24:25.052664 [29039]: server/ctdb_takeover.c:259 Failed to send tcp tickle ack for <ip address> ....repeats for every client 2015/05/18 14:24:38.389204 [29039]: INTERNAL ERROR: Signal 6 in ctdbd pid 290392015/05/18 14:24:38.389209 [29039]: Please read the file BUGS.txt in the distribution Node 2: 2015/05/18 14:24:38.389035 [12968]: Freeze priority 1 2015/05/18 14:24:39.235547 [13021]: ctdb_control error: 'node is disconnected' 2015/05/18 14:24:39.235572 [13021]: ctdb_control error: 'node is disconnected' 2015/05/18 14:24:39.235583 [13021]: Async operation failed with ret=-1 res=-1 opcode=80 2015/05/18 14:24:39.235590 [13021]: Async wait failed - fail_count=1 2015/05/18 14:24:39.235598 [13021]: server/ctdb_recoverd.c:251 Failed to read node capabilities. 2015/05/18 14:24:39.235605 [13021]: server/ctdb_recoverd.c:3025 Unable to update node capabilities. 2015/05/18 14:24:59.405500 [12968]: Freeze priority 1 2015/05/18 14:25:03.257579 [12968]: Could not add client IP 155.198.17.133. This is not a public address. 2015/05/18 14:25:20.427135 [12968]: Banning this node for 300 seconds 2015/05/18 14:25:20.427216 [12968]: Freeze priority 1 2015/05/18 14:25:20.427231 [12968]: Freeze priority 2 2015/05/18 14:25:20.429833 [12968]: Freeze priority 3 2015/05/18 14:25:20.565797 [12968]: Killing TCP connection <ip address>:59613 <ip address>:445 .....repeats killing all TCP connections ..... 2015/05/18 14:25:21.331435 [13021]: Node is stopped or banned but recovery mode is not active. Activate recovery mode and lock databases 2015/05/18 14:25:21.331517 [12968]: Freeze priority 1 2015/05/18 14:25:29.422120 [12968]: Timed out killing tcp connections for IP 155.198.63.145 2015/05/18 14:25:29.617412 [12968]: lib/events/events_standard.c:593 ERROR Time jumped forward by 9 seconds 2015/05/18 14:26:29.712507 [12968]: server/ctdb_recover.c:582 Been in recovery mode for too long. Dropping all IPS 2015/05/18 14:30:20.428137 [12968]: Banning timedout Node 3: 2015/05/18 14:24:38.389263 [31822]: Freeze priority 1 2015/05/18 14:24:39.245580 [32106]: ctdb_control error: 'node is disconnected' 2015/05/18 14:24:39.245616 [32106]: ctdb_control error: 'node is disconnected' 2015/05/18 14:24:39.245628 [32106]: Async operation failed with ret=-1 res=-1 opcode=80 2015/05/18 14:24:39.245636 [32106]: Async wait failed - fail_count=1 2015/05/18 14:24:39.245644 [32106]: server/ctdb_recoverd.c:251 Failed to read node capabilities. 2015/05/18 14:24:39.245651 [32106]: server/ctdb_recoverd.c:3025 Unable to update node capabilities. 2015/05/18 14:24:59.405728 [31822]: Freeze priority 1 2015/05/18 14:25:29.617999 [31822]: Banning this node for 300 seconds At this point the samba logs produce: Node 1: [2015/05/18 14:25:39.313788, 0] ../source3/lib/ctdbd_conn.c:668(ctdb_handle_message) Got cluster reconfigure message Node 2: ...Selection of errors [2015/05/18 14:24:39.234057, 0] ../source3/lib/ctdbd_conn.c:401(ctdb_read_req) packet_fd_read failed: NT_STATUS_END_OF_FILE [2015/05/18 14:24:39.234215, 0] ../source3/lib/ctdbd_conn.c:86(cluster_fatal) cluster fatal event: ctdbd died [2015/05/18 14:24:39.234146, 0] ../source3/lib/ctdbd_conn.c:401(ctdb_read_req) - exiting immediately [2015/05/18 14:24:39.234524, 0] ../source3/lib/ctdbd_conn.c:86(cluster_fatal) - exiting immediately [2015/05/18 14:24:39.235395, 0] ../source3/lib/ctdbd_conn.c:725(ctdbd_socket_handler) packet_fd_read failed: NT_STATUS_END_OF_FILE ...Errors above repeat 100s of times Node 3: [2015/05/18 14:25:23.071086, 0] ../source3/lib/util.c:477(reinit_after_fork) messaging_reinit() failed: NT_STATUS_INTERNAL_DB_ERROR [2015/05/18 14:30:32.345565, 0] ../source3/lib/ctdbd_conn.c:668(ctdb_handle_message) Got cluster reconfigure message Node 4: [2015/05/18 14:25:29.768514, 0] ../source3/lib/util.c:477(reinit_after_fork) messaging_reinit() failed: NT_STATUS_INTERNAL_DB_ERROR [2015/05/18 14:30:59.052518, 0] ../source3/lib/ctdbd_conn.c:668(ctdb_handle_message) Got cluster reconfigure message There are further entries in the samba logs, not directly at the time of the CTDB problem though. There are multiple entries similar to: [2015/05/18 14:33:41.898756, 0] ../source3/lib/util.c:896(log_stack_trace) BACKTRACE: 27 stack frames: #0 /usr/lib64/samba/libsmbconf.so.0(log_stack_trace+0x1c) [0x7f404dc429b7] #1 /usr/lib64/samba/libsmbconf.so.0(smb_panic_s3+0x55) [0x7f404dc42ab9] #2 /usr/lib64/samba/libsamba-util.so.0(smb_panic+0x2d) [0x7f404f78abae] #3 /usr/lib64/samba/libtalloc.so.2(+0x23a1) [0x7f404ee673a1] #4 /usr/lib64/samba/libtalloc.so.2(+0x41c6) [0x7f404ee691c6] #5 /usr/lib64/samba/libtalloc.so.2(_talloc_get_type_abort+0x64) [0x7f404ee6922f] #6 /usr/lib64/samba/libsmbconf.so.0(+0x361c6) [0x7f404dc531c6] #7 /usr/lib64/samba/libdbwrap.so(dbwrap_parse_record+0x24) [0x7f404a5711d7] #8 /usr/lib64/samba/libsmbd_base.so(fetch_share_mode_unlocked+0xa5) [0x7f404f4003e6] #9 /usr/lib64/samba/libsmbd_base.so(get_file_infos+0x6e) [0x7f404f3f8765] #10 /usr/lib64/samba/libsmbd_base.so(smbd_dirptr_get_entry+0x40f) [0x7f404f327860] #11 /usr/lib64/samba/libsmbd_base.so(smbd_dirptr_lanman2_entry+0x1ef) [0x7f404f36176d] #12 /usr/lib64/samba/libsmbd_base.so(smbd_smb2_request_process_find+0xd56) [0x7f404f3b5e36] #13 /usr/lib64/samba/libsmbd_base.so(smbd_smb2_request_dispatch+0xf89) [0x7f404f3a4d57] #14 /usr/lib64/samba/libsmbd_base.so(+0x12b65a) [0x7f404f3a565a] #15 /usr/lib64/samba/libsmbd_base.so(+0x12c8af) [0x7f404f3a68af] #16 /usr/lib64/samba/libsmbconf.so.0(run_events_poll+0x2c2) [0x7f404dc5f815] #17 /usr/lib64/samba/libsmbconf.so.0(+0x42c5d) [0x7f404dc5fc5d] #18 /usr/lib64/samba/libtevent.so.0(_tevent_loop_once+0x92) [0x7f404f071407] #19 /usr/lib64/samba/libsmbd_base.so(smbd_process+0xf76) [0x7f404f390348] #20 /usr/sbin/smbd(+0x8860) [0x7f404fded860] #21 /usr/lib64/samba/libsmbconf.so.0(run_events_poll+0x2c2) [0x7f404dc5f815] #22 /usr/lib64/samba/libsmbconf.so.0(+0x42c5d) [0x7f404dc5fc5d] #23 /usr/lib64/samba/libtevent.so.0(_tevent_loop_once+0x92) [0x7f404f071407] #24 /usr/sbin/smbd(main+0x17c9) [0x7f404fdef2ee] #25 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f404c6dfd5d] #26 /usr/sbin/smbd(+0x58b9) [0x7f404fdea8b9] Also 1000s of entries similar to: [2015/05/18 14:09:45.561245, 0] ../source3/modules/vfs_gpfs.c:1799(vfs_gpfs_get_quotas) Get fset id failed path </path/to/directory/a/b/c>, dir </path/to/directory/a/b/c>, errno 13. I'm wondering if we've got multiple problems going on at the same time or if they are all related to some degree. Could there be a bug in CTDB? Any pointers greatly appreciated. Regards, Danny Metcalfe --- This email has been checked for viruses by Avast antivirus software. http://www.avast.com
Possibly Parallel Threads
- ctdb split brain nodes doesn't see each other
- [ctdb]Unable to run startrecovery event(if mail content is encrypted, please see the attached file)
- CTDB node stucks in " ctdb-eventd[13184]: 50.samba: samba not listening on TCP port 445"
- CTDB node stucks in " ctdb-eventd[13184]: 50.samba: samba not listening on TCP port 445"
- CTDB and IPv6