On Wed, 2016-10-12 at 08:36 +0200, Paul Slootman wrote:> As always it's best to first upgrade to the current version (3.1.3) > if at all possible, as there's always the chance that the cause of > your problems has already been fixed.Good call, but I believe I may have ruled this out. I didn't upgrade to 3.1.3, but both sides are running 3.1.1 protocol version 31 now. Same problem. I think the key insight was in the strace log which showed the select() call was timed out. If I knew what type of file descriptor it was being fed, I might have a clue. It might have been a socket or something on disk. I don't know. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: This is a digitally signed message part URL: <http://lists.samba.org/pipermail/rsync/attachments/20161012/54739acc/signature.sig>
On Wed 12 Oct 2016, Kip Warner wrote:> > I think the key insight was in the strace log which showed the select() > call was timed out. If I knew what type of file descriptor it was being > fed, I might have a clue. It might have been a socket or something on > disk. I don't know.You can use lsof -p $pid to show what files that process has opened. On linux you can also use 'ls -l /proc/$pid/fd'. Paul
On Wed, Oct 12, 2016 at 8:30 PM, Kip Warner <kip at thevertigo.com> wrote:> I think the key insight was in the strace log which showed the > select() call was timed out.No, that's totally expected. While select() is waiting for I/O to arrive, it returns to rsync every 60 seconds to allow it to decide if it wants to continue waiting or do something else. You have to find the process that is exiting first to discover why the connection is closing. This assumes that it's not a network issue, where all of the programs get a closed-connection error. ..wayne.. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20161013/4815e20e/attachment.html>
I don't remember whether or not you said you were running rsync over ssh but if you are you can also debug the ssh layer. You can even do it at both ends.... On the server run a debugging sshd on an alternate port with: /usr/sbin/sshd -dDp222 (note that this will only accept 1 connection, debug to the terminal, then exit) then on the client use rsync -e "ssh -vp222" to go verbose and use the debugging port. On 10/13/2016 08:05 PM, Kip Warner wrote:> On Thu, 2016-10-13 at 16:58 -0700, Wayne Davison wrote: >> No, that's totally expected. While select() is waiting for I/O to >> arrive, it returns to rsync every 60 seconds to allow it to decide if >> it wants to continue waiting or do something else. You have to find >> the process that is exiting first to discover why the connection is >> closing. This assumes that it's not a network issue, where all of the >> programs get a closed-connection >> error. > > Hey Wayne, > > I took Paul's advice and tried monitoring with lsof. This is what I > saw: > > $ lsof -p 3104 > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > rsync 3104 kip cwd DIR 0,19 4096 214695958 /home/kip/Disk_Backups/kip-desktop/yakkety > rsync 3104 kip rtd DIR 179,2 4096 2 / > rsync 3104 kip txt REG 179,2 460584 7817 /usr/bin/rsync > rsync 3104 kip mem REG 179,2 42692 3974 /lib/arm-linux-gnueabihf/libnss_files-2.13.so > rsync 3104 kip mem REG 179,2 71624 3971 /lib/arm-linux-gnueabihf/libnsl-2.13.so > rsync 3104 kip mem REG 179,2 38608 3976 /lib/arm-linux-gnueabihf/libnss_nis-2.13.so > rsync 3104 kip mem REG 179,2 26484 3972 /lib/arm-linux-gnueabihf/libnss_compat-2.13.so > rsync 3104 kip mem REG 179,2 2953776 18967 /usr/lib/locale/locale-archive > rsync 3104 kip mem REG 179,2 1216624 3965 /lib/arm-linux-gnueabihf/libc-2.13.so > rsync 3104 kip mem REG 179,2 130448 41964 /lib/arm-linux-gnueabihf/libgcc_s.so.1 > rsync 3104 kip mem REG 179,2 43024 10745 /lib/arm-linux-gnueabihf/libpopt.so.0.0.0 > rsync 3104 kip mem REG 179,2 26240 1840 /lib/arm-linux-gnueabihf/libacl.so.1.1.0 > rsync 3104 kip mem REG 179,2 17904 1854 /lib/arm-linux-gnueabihf/libattr.so.1.1.0 > rsync 3104 kip mem REG 179,2 10170 28719 /usr/lib/arm-linux-gnueabihf/libcofi_rpi.so > rsync 3104 kip mem REG 179,2 126236 3961 /lib/arm-linux-gnueabihf/ld-2.13.so > rsync 3104 kip 1w FIFO 0,10 0t0 8455 pipe > rsync 3104 kip 2w FIFO 0,10 0t0 8456 pipe > rsync 3104 kip 3u unix 0xdbda86c0 0t0 8466 socket > > And on the strace... > > ... > 3104 select(4, [3], [1], [3], {60, 0}) = 1 (out [1], left {59, 999978}) > 3104 write(1, "h\340\0\7\0y@\312\227\241\236\255\367\312\3637\323X\206\314\250\372\362P1\202\374\"i\265c\16"..., 57452) = 57452 > 3104 select(4, [3], [], [3], {60, 0}) = 0 (Timeout) > (repeats above line until client (sender) reports connection timed out) > > Sender reports the following until it times out... > > ... > match at 32949141504 last_match=32949141504 j=251382 len=131072 n=0 > 32.95G 96% 83.55MB/s 0:00:12 > > >-- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., Kevin Korb Phone: (407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. Kevin at FutureQuest.net (work) Orlando, Florida kmk at sanitarium.net (personal) Web page: http://www.sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._., -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 181 bytes Desc: OpenPGP digital signature URL: <http://lists.samba.org/pipermail/rsync/attachments/20161013/ccb445f5/signature.sig>
On Thu, 2016-10-13 at 20:08 -0400, Kevin Korb wrote:> I don't remember whether or not you said you were running rsync over > ssh but if you are you can also debug the ssh layer. You can even do > it at both ends....Yes, I'm doing this over an SSH tunnel.> On the server run a debugging sshd on an alternate port with: > /usr/sbin/sshd -dDp222 > (note that this will only accept 1 connection, debug to the terminal, > then exit) then on the client use rsync -e "ssh -vp222" to go verbose > and use the debugging port.This is a great idea except that I only have one port forwarded to the server (beyond my control) and that port is for the current SSH session I would need to spawn another one on some other inaccessible port. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: This is a digitally signed message part URL: <http://lists.samba.org/pipermail/rsync/attachments/20161015/510ec343/signature.sig>
On Thu, 2016-10-13 at 20:08 -0400, Kevin Korb wrote:> I don't remember whether or not you said you were running rsync over > ssh but if you are you can also debug the ssh layer. You can even do > it at both ends.... > > On the server run a debugging sshd on an alternate port with: > /usr/sbin/sshd -dDp222 > (note that this will only accept 1 connection, debug to the terminal, > then exit) then on the client use rsync -e "ssh -vp222" to go verbose > and use the debugging port.Hey Kevin, I managed to get a chance to try your suggestion. I noticed something very interesting. On the server side, there's actually two processes of rsync that appear to be spawned by the client's connection. Running the ssh daemon on the server side, I notice that long after the server dies like so... ... debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req exec Read error from remote host <snip>: Connection reset by peer debug1: do_cleanup debug1: do_cleanup debug1: PAM: cleanup debug1: PAM: closing session Sessions still open, not unmounting debug1: PAM: deleting credentials ...and the client rsync dies... match at 32949010432 last_match=32949010432 j=251381 len=131072 n=0 match at 32949141504 last_match=32949141504 j=251382 len=131072 n=0 32.95G 96% 74.13MB/s 0:00:14 packet_write_wait: Connection to <snip> port 22223: Broken pipe rsync: [sender] write error: Broken pipe (32) rsync error: unexplained error (code 255) at io.c(820) [sender=3.1.1] [sender] _exit_cleanup(code=10, file=io.c, line=820): about to call exit(255) ...one of the server side rsync processes according to strace is still alive... write(1, "h\340\0\7\310\376\\\233\227\241\236\255VC\31\351\323X\206\314F$\337\0321\202\374\"\320(d\271"..., 57452) = 57452 select(4, [3], [], [3], {60, 0}) = 0 (Timeout) (last line repeats a couple hundred times) ...and apparently the second rsync server side process is actually still performing reads and writes more than an hour after the connection died... ... read(1, "\337\363\356\1^\26L\316\17\31izD\254\27\346\267\266H\343\223\v\357\252d'h\351\371\0ny"..., 262144) = 262144 write(3, "\337\363\356\1^\26L\316\17\31izD\254\27\346\267\266H\343\223\v\357\252d'h\351\371\0ny"..., 262144) = 262144 ... This makes me wonder if this has something to do with memory buffers for file i/o being exhausted on the server and this server side activity long after the connection died is an effort to flush a file buffer to disk. The file this always breaks on is 30+GB, but the server hardware is a small embedded system. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: This is a digitally signed message part URL: <http://lists.samba.org/pipermail/rsync/attachments/20161015/3435fbd7/signature.sig>