Apollon Oikonomopoulos
2017-Apr-05 11:49 UTC
Transfer hangs, both sides waiting on receive
Hi, We're encountering a rather weird, sporadic hang during a large-file transfer. A bit of background on the setup: - Host A rsyncs a MySQL database from Host B. Both sides use rsync 3.1.1 on Debian Jessie. - Host B serves the files using rsync --daemon from an ext4 filesystem residing on a thin LVM snapshot. - Host A uses the following command to rsync the files to a btrfs subvolume: rsync -og -avPH --log-file=/var/log/mysql-rsync.log --debug=ALL --chown=user:user --delete --inplace rsync://hostb.example.com/mysql/ /media/btrfs/db-snapshot-$$/mysql (I know parts of the command (-og + -a + --chown) don't make much sense :) - Both hosts sit on the same LAN, connected to a Juniper EX4200 stack with active-backup interfaces. MTU is standard 1500 and there are no known network issues. Host A has Intel I350 (igb) NICs, while Host B has BCM5709 (bnx2) NICs. Both hosts run Linux 3.16.36. Once every few (~3-5) transfers, the rsync process hangs with both sides waiting on select() on the TCP socket. From the transfer logs I can see that the hang always happens when about to receive the same 32GB file, already a couple of GBs into the total transfer. Host A reports: .... 04:41:03 [6281] [receiver] send_msg_int(100, 1701) 04:41:03 [6281] recv_files(mysql/innodb_table_stats.ibd) 04:41:03 [6281] sender finished //mysql/innodb_table_stats.ibd 04:41:03 [6281] send_files(1749, //mysql/user.MYD) 04:41:03 [6281] sender finished //mysql/user.MYD 04:41:03 [6281] send_files(1750, //mysql/user.MYI) 04:41:03 [6281] sender finished //mysql/user.MYI 04:41:03 [6281] send_files(1806, //some_dbase) 04:41:03 [6281] send_files(1808, //some_dbase/actions.ibd) 04:41:03 [6281] >f.st...... mysql/innodb_table_stats.ibd 04:41:03 [6281] finishing mysql/innodb_table_stats.ibd 04:41:03 [6281] set modtime of mysql/innodb_table_stats.ibd to (1491343251) Wed Apr 5 01:00:51 2017 04:41:03 [6281] [receiver] send_msg_int(100, 1703) 04:41:03 [6281] recv_files(mysql/user.MYD) 04:41:03 [6281] >f.st...... mysql/user.MYD 04:41:03 [6281] finishing mysql/user.MYD 04:41:03 [6281] set modtime of mysql/user.MYD to (1487686529) Tue Feb 21 16:15:29 2017 04:41:03 [6281] [receiver] send_msg_int(100, 1749) 04:41:03 [6281] recv_files(mysql/user.MYI) 04:41:03 [6281] >f..t...... mysql/user.MYI 04:41:03 [6281] finishing mysql/user.MYI 04:41:03 [6281] set modtime of mysql/user.MYI to (1487689840) Tue Feb 21 17:10:40 2017 04:41:03 [6281] [receiver] send_msg_int(100, 1750) 04:41:03 [6281] recv_files(some_dbase) 04:41:03 [6281] .d..t...... some_dbase/ <silence> while Host B reports: .... 01:41:03 [3726] send_files(1749, //mysql/user.MYD) 01:41:03 [3726] <f.st...... mysql/user.MYD 01:41:03 [3726] sender finished //mysql/user.MYD 01:41:03 [3726] send_files(1750, //mysql/user.MYI) 01:41:03 [3726] <f..t...... mysql/user.MYI 01:41:03 [3726] sender finished //mysql/user.MYI 01:41:03 [3726] send_files(1806, //some_dbase) 01:41:03 [3726] .d..t...... some_dbase/ 01:41:03 [3726] send_files(1808, //some_dbase/actions.ibd) <silence> The backtraces of the hung processes are: Host A (receiver): #0 0x00007f87d7da3873 in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f87d88dd7a2 in perform_io (needed=needed at entry=4, flags=flags at entry=17) at io.c:742 #2 0x00007f87d88deef7 in raw_read_int () at io.c:908 #3 0x00007f87d88df683 in read_a_msg () at io.c:1423 #4 0x00007f87d88e0676 in read_buf (f=f at entry=4, buf=buf at entry=0x7ffd814c0df0 "\300\037L\201\375\177", len=len at entry=1) at io.c:1835 #5 0x00007f87d88e157b in read_ndx (f=f at entry=4) at io.c:2222 #6 0x00007f87d88be377 in read_ndx_and_attrs (f_in=4, f_out=6, iflag_ptr=0x7ffd814c0f28, type_ptr=0x7ffd814c0f27 "\200\b", buf=0x7ffd814c1fc0 "", len_ptr=0x7ffd814c0f2c) at rsync.c:320 #7 0x00007f87d88c5b5e in recv_files (f_in=5, f_out=-2125728480, local_name=0x7ffd814c0fc0 "some_dbase") at receiver.c:548 #8 0x00007f87d88d0787 in do_recv (f_in=4, f_out=6, local_name=0x0) at main.c:906 #9 0x00007f87d88d1218 in client_run (f_in=4, f_out=4, pid=-1, argc=<optimized out>, argv=<optimized out>) at main.c:1213 #10 0x00007f87d88b4330 in start_client (argv=<optimized out>, argc=<optimized out>) at main.c:1382 #11 main (argc=2, argv=0x7f87d9000240) at main.c:1651 Host B (sender): #0 0x00007f9a7130a873 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f9a71e447a2 in perform_io (needed=needed at entry=4, flags=flags at entry=17) at io.c:742 #2 0x00007f9a71e45e2e in raw_read_buf (buf=0x7ffe7c1aa380 "", len=4) at io.c:895 #3 0x00007f9a71e47643 in read_buf (f=<optimized out>, buf=buf at entry=0x7ffe7c1aa380 "", len=len at entry=4) at io.c:1840 #4 0x00007f9a71e47851 in read_int (f=<optimized out>) at io.c:1711 #5 0x00007f9a71e2f652 in receive_sums (f=<optimized out>) at sender.c:96 #6 send_files (f_in=4, f_out=1930300096) at sender.c:332 #7 0x00007f9a71e37f6c in do_server_sender (argv=<optimized out>, argc=<optimized out>, f_out=3, f_in=3) at main.c:840 #8 start_server (f_in=3, f_out=3, argc=2, argv=0x7f9a7302f240) at main.c:1107 #9 0x00007f9a71e5a34e in rsync_module (f_in=3, f_out=3, i=0, addr=0x7f9a72098518 <numeric_ids> "\377\377\377\377", host=0x7f9a72098c80 <name_buf> "host-a.example.com") at clientserver.c:1009 #10 0x00007f9a71e5b260 in start_daemon (f_in=3, f_out=3) at clientserver.c:1100 #11 0x00007f9a71e4f5d1 in start_accept_loop (port=4, fn=0x7ffe7c1aa160, fn at entry=0x7f9a71e5b120 <start_daemon>) at socket.c:618 #12 0x00007f9a71e5bb46 in daemon_main () at clientserver.c:1202 #13 0x00007f9a71e1aeb1 in main (argc=0, argv=0x0) at main.c:1626 The worrysome part is the Send-Q on Host A, which last time contained 263392 unACK'd bytes. The rest of the queues involved were empty. tcpdump reveals no retransmission attempts, only 2 window updates and a stream of TCP keepalives after the point where the transfer halts: Bytes Time (s) SRC DST Proto Details 12995 1.637946 hostb hosta RSYNC Server MOTD 86 1.637963 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37686543 Win=23993 Len=0 TSval=2255190023 TSecr=100923227 86 1.638406 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37686543 Ack=1 Win=1688 Len=0 TSval=100923227 TSecr=2255190022 1514 1.641767 hostb hosta RSYNC Server MOTD 86 1.641802 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37687971 Win=23985 Len=0 TSval=2255190024 TSecr=100923228 1381 1.641938 hostb hosta RSYNC Server MOTD 86 1.641955 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37689266 Win=23977 Len=0 TSval=2255190024 TSecr=100923228 86 1.647888 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=6840 Len=0 TSval=100923230 TSecr=2255190024 86 121.931963 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255220097 TSecr=100923230 86 121.932062 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100953302 TSecr=2255190024 86 242.248005 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255250176 TSecr=100953302 86 242.248090 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100983381 TSecr=2255190024 86 362.567974 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255280256 TSecr=100983381 86 362.568085 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101013461 TSecr=2255190024 86 482.891958 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255310336 TSecr=101013461 86 482.892026 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101043542 TSecr=2255190024>From the backtraces it appears as though the sender is waiting for thechecksums the client has sent, while the client is waiting for the next file id to come in. I can only assume the deadlock is because of the client's checksums sitting in its Send-Q. The only potentially related part I can find in the traffic captures are a couple of Zero Window responses by hostb, which however are followed by window updates with non-zero sizes. Any ideas or hints what else to look for? Regards, Apollon