Scott Mcdermott
2011-May-20 03:25 UTC
hang in select() on unix domain sockets, 60s timeout loop
I have rsync 3.0.8 on both ends, over ssh, which on remote server appears to be hung in select(): Process has fd 0, 1, 2, and all are unix sockets it's just hung, keeps timing out every 60 seconds then calls select again it's been hung for 15 hours flags on the remote are: --server --sender -lHogDtpAXrRe.iLs --numeric-ids --inplace it loops every 60 seconds with: select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) the listed readfd is a unix domain socket: $ sudo readlink /proc/`pgrep rsync`/fd/0 socket:[62052357] $ sudo lsof | grep 62052357 rsync 4532 root 0u unix 0xe68aa040 0t0 62052357 socket rsync 4532 root 1u unix 0xe68aa040 0t0 62052357 socket $ grep 62052357 /proc/net/unix e68aa040: 00000003 00000000 00000000 0001 03 62052357 So it's the same process. Is it hung on itself? Howcome it doesn't respond to timeout and just goes over again? Is it waiting for a signal? Can I send it one and unstick it? There don't appear to be any other fds of interest in the select loop so I'm not sure what other event it could be waiting on besides a signal. It has been hung over 15 hours in same loop. I did some searching and found some references to a Cygwin issue, and also an old issue with non-blocking file descriptors of ssh that appears to be fixed. However I don't see how ssh could be part of the picture here since rsync is waiting on itself and nothing else seems to be involved. Unless we are waiting for SIGCHLD? But rsync has no children in this case and only one other open fd (another unix domain socket on fd2, this time with nobody else on the other end looks like) This has happened a few times now (for our backups) but does not happen every time. A little confused... I can add '--timeout' but I'd really prefer to know why it's doing this and be able to distinguish a real timeout error from an rsync (or libc?) bug... -- Scott