samba-bugs at samba.org
2022-Sep-05  13:51 UTC
[Bug 15163] New: rsync timeout non-effective
https://bugzilla.samba.org/show_bug.cgi?id=15163
            Bug ID: 15163
           Summary: rsync timeout non-effective
           Product: rsync
           Version: 3.2.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P5
         Component: core
          Assignee: wayne at opencoder.net
          Reporter: devzero at web.de
        QA Contact: rsync-qa at samba.org
  Target Milestone: ---
there seem to be circumstances, where timeout param is not effective
if i do
/root/rsync/rsync-3.2.5/rsync -avi --timeout=10 --exclude='/proc'
--exclude='/dev/'  root at 172.20.37.189:/
/zfspool/backup/172.20.37.189/
it hangs forever , as soon as rsync tries to traverse some suspended/hanging
/iscsipool on the remote machine (as this results in locked process as it's
in
non-interruptible syscall)
if i do
/root/rsync/rsync-3.2.5/rsync -avi --timeout=10 --exclude='/proc'
--exclude='/dev/'  root at 172.20.37.189:/iscsipool
/zfspool/backup/172.20.37.189/
i.e. if i try to rsync the "hung/stalled" directory directly,  i'm
getting
proper timeout behaviour:
receiving incremental file list
[Receiver] io timeout after 10 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(197)
[Receiver=3.2.5]
<proper exit and return to shell>
so, apparently there seem to be circumstances where the timeout detection is
not effective
i tried to have a look with strace but don't get a clue what's the
problem, as
in both cases rsync gets stuck in /iscsipool
if i run rsync locally on the machine containing the hung/stalled dir, it
behaves like this:
# rsync -av --timeout=1 --dry-run /iscsipool /tmp
sending incremental file list
[Receiver] io timeout after 1 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(197)
[Receiver=3.2.3]
<here it hangs forever, never returns to shell , cannot even ^C or ^Z>
i have seen rsync hang in my scripted backup solution on other situations, e.g.
stale nfs mount etc and since i have a repro-case now, i decided to open a
bugticket for it 
also see https://www.mail-archive.com/rsync at lists.samba.org/msg33555.html
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
https://bugzilla.samba.org/show_bug.cgi?id=15163
--- Comment #1 from roland <devzero at web.de> ---
here is some strace from the backup host to show the difference
if i set timeout=60 , rsyncing "root at 172.20.37.189:/" hangs
forever:
08:25:30 select(4, [3], [], [3], {tv_sec=30, tv_usec=0}) = 1 (in [3], left
{tv_sec=0, tv_usec=316390})
08:26:00 read(3, "\0\0\0\7", 32768)     = 4
08:26:00 select(5, [3], [4], [3], {tv_sec=30, tv_usec=0}) = 1 (out [4], left
{tv_sec=29, tv_usec=999994})
08:26:00 write(4, "\0\0\0\7", 4)        = 4
 | 00000  00 00 00 07                                       ....             |
08:26:00 select(4, [3], [], [3], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:26:30 select(5, [3], [4], [3], {tv_sec=30, tv_usec=0}) = 1 (out [4], left
{tv_sec=29, tv_usec=999995})
08:26:30 write(4, "\0\0\0\7", 4)        = 4
 | 00000  00 00 00 07                                       ....             |
08:26:30 select(4, [3], [], [3], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:27:00 select(5, [3], [4], [3], {tv_sec=30, tv_usec=0}) = 1 (out [4], left
{tv_sec=29, tv_usec=999994})
08:27:00 write(4, "\0\0\0\7", 4)        = 4
 | 00000  00 00 00 07                                       ....             |
08:27:00 select(4, [3], [], [3], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:27:30 select(5, [3], [4], [3], {tv_sec=30, tv_usec=0}) = 1 (out [4], left
{tv_sec=29, tv_usec=999994})
08:27:30 write(4, "\0\0\0\7", 4)        = 4
 | 00000  00 00 00 07                                       ....             |
08:27:30 select(4, [3], [], [3], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:28:00 select(5, [3], [4], [3], {tv_sec=30, tv_usec=0}) = 1 (out [4], left
{tv_sec=29, tv_usec=999994})
08:28:00 write(4, "\0\0\0\7", 4)        = 4
 | 00000  00 00 00 07                                       ....             |
08:28:00 select(4, [3], [], [3], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:28:30 select(5, [3], [4], [3], {tv_sec=30, tv_usec=0}) = 1 (out [4], left
{tv_sec=29, tv_usec=999994})
08:28:30 write(4, "\0\0\0\7", 4)        = 4
 | 00000  00 00 00 07                                       ....             |
08:28:30 select(4, [3], [], [3], {tv_sec=30, tv_usec=0}
<repeats forever>
mind that tv_sec=30 is half of the timeout. if i set timeout=10 , tv_sec is 5. 
rsyncing only /iscsipool also hangs, but times out .   for my curiosity setting
60s timeout leads to 90s timeout, but not 60s. apparently, half of the timeout
is being added to the real timeout.
# time /root/rsync/rsync-3.2.5/rsync -avi --timeout=10 --exclude='/proc'
--exclude='/dev/' --exclude='/sys'  root at
172.20.37.189:/iscsipool
/zfspool/backup/172.20.37.189/backup
receiving incremental file list
[Receiver] io timeout after 10 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(197)
[Receiver=3.2.5]
real    0m15.356s  <- !!!
user    0m0.001s
sys     0m0.005s
# time /root/rsync/rsync-3.2.5/rsync -avi --timeout=60 --exclude='/proc'
--exclude='/dev/' --exclude='/sys'  root at
172.20.37.189:/iscsipool
/zfspool/backup/172.20.37.189/backup
receiving incremental file list
[Receiver] io timeout after 60 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(197)
[Receiver=3.2.5]
real    1m30.440s  <-!!!
user    0m0.002s
sys     0m0.003s
 | 00000  72 65 63 65 69 76 69 6e  67 20 69 6e 63 72 65 6d  receiving increm |
 | 00010  65 6e 74 61 6c 20 66 69  6c 65 20 6c 69 73 74 0a  ental file list. |
08:35:49 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x7f74e5536000
08:35:49 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x7f74e54f5000
08:35:49 select(6, [5], [4], [5], {tv_sec=30, tv_usec=0}) = 1 (out [4], left
{tv_sec=29, tv_usec=999995})
08:35:49 write(4, "$\0\0\7\7\0\0\0- /proc\7\0\0\0- /dev/\6\0\0\0-
"..., 40) 40
 | 00000  24 00 00 07 07 00 00 00  2d 20 2f 70 72 6f 63 07  $.......- /proc. |
 | 00010  00 00 00 2d 20 2f 64 65  76 2f 06 00 00 00 2d 20  ...- /dev/....-  |
 | 00020  2f 73 79 73 00 00 00 00                           /sys....         |
08:35:49 select(6, [5], [], [5], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:36:19 select(6, [5], [], [5], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:36:49 select(6, [5], [], [5], {tv_sec=30, tv_usec=0}) = 0 (Timeout)
08:37:19 write(2, "[Receiver] io timeout after 60 s"..., 50[Receiver]
io
timeout after 60 seconds -- exiting
) = 50
 | 00000  5b 52 65 63 65 69 76 65  72 5d 20 69 6f 20 74 69  [Receiver] io ti |
 | 00010  6d 65 6f 75 74 20 61 66  74 65 72 20 36 30 20 73  meout after 60 s |
 | 00020  65 63 6f 6e 64 73 20 2d  2d 20 65 78 69 74 69 6e  econds -- exitin |
 | 00030  67 0a                                             g.               |
08:37:19 rt_sigaction(SIGUSR1, {sa_handler=SIG_IGN, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7f74e4965400}, NULL, 8) = 0
08:37:19 rt_sigaction(SIGUSR2, {sa_handler=SIG_IGN, sa_mask=[],
sa_flags=SA_RESTORER, sa_restorer=0x7f74e4965400}, NULL, 8) = 0
08:37:19 wait4(27237, 0x7ffda918da9c, WNOHANG, NULL) = 0
08:37:19 getpid()                       = 27217
08:37:19 kill(27237, SIGUSR1)           = 0
08:37:19 write(2, "rsync error: timeout in data sen"..., 82rsync
error: timeout
in data send/receive (code 30) at io.c(197) [Receiver=3.2.5]
) = 82
 | 00000  72 73 79 6e 63 20 65 72  72 6f 72 3a 20 74 69 6d  rsync error: tim |
 | 00010  65 6f 75 74 20 69 6e 20  64 61 74 61 20 73 65 6e  eout in data sen |
 | 00020  64 2f 72 65 63 65 69 76  65 20 28 63 6f 64 65 20  d/receive (code  |
 | 00030  33 30 29 20 61 74 20 69  6f 2e 63 28 31 39 37 29  30) at io.c(197) |
 | 00040  20 5b 52 65 63 65 69 76  65 72 3d 33 2e 32 2e 35   [Receiver=3.2.5 |
 | 00050  5d 0a                                             ].               |
08:37:19 exit_group(30)                 = ?
08:37:19 +++ exited with 30 +++
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
samba-bugs at samba.org
2022-Sep-11  07:51 UTC
[Bug 15163] rsync timeout non-effective and incorrect behaviour
https://bugzilla.samba.org/show_bug.cgi?id=15163 --- Comment #2 from roland <devzero at web.de> --- here is another bugreport, where timeout is not effective/working https://bugzilla.redhat.com/show_bug.cgi?id=1944132 -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2022-Sep-11  08:00 UTC
[Bug 15163] rsync timeout non-effective and incorrect behaviour
https://bugzilla.samba.org/show_bug.cgi?id=15163
--- Comment #3 from roland <devzero at web.de> ---
ah, getting a clue
in io.c
static void check_timeout(BOOL allow_keepalive, int keepalive_flags)
{
        time_t t, chk;
        /* On the receiving side, the generator is now the one that decides
         * when a timeout has occurred.  When it is sifting through a lot of
         * files looking for work, it will be sending keep-alive messages to
         * the sender, and even though the receiver won't be
sending/receiving
         * anything (not even keep-alive messages), the successful writes to
         * the sender will keep things going.  If the receiver is actively
         * receiving data, it will ensure that the generator knows that it is
         * not idle by sending the generator keep-alive messages (since the
         * generator might be blocked trying to send checksums, it needs to
         * know that the receiver is active).  Thus, as long as one or the
         * other is successfully doing work, the generator will not timeout. */
        if (!io_timeout)
                return;
        t = time(NULL);
        if (allow_keepalive) {
                /* This may put data into iobuf.msg w/o flushing. */
                maybe_send_keepalive(t, keepalive_flags);
        }
        if (!last_io_in)
                last_io_in = t;
        if (am_receiver)
                return;
        chk = MAX(last_io_out, last_io_in);
        if (t - chk >= io_timeout) {
                if (am_server)
                        msgs2stderr = 1;
                rprintf(FERROR, "[%s] io timeout after %d seconds --
exiting\n",
                        who_am_i(), (int)(t-chk));
                exit_cleanup(RERR_TIMEOUT);
        }
}
so , apparently when syncing "/" instead of "/iscsipool",
keepalive messages
come into the play....
i see keepalive messages getting sent, but the are NOT replied by the remote
rsync machine, as strace shows. there is only sshd process actively processing
data.
 | 00000  00 00 00 07                                       ....             |
[pid 131555] 09:57:47.454101 select(14, [4 5 11 13], [], NULL, NULL) = 1 (in
[4])
[pid 131555] 09:57:50.455771 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 131555] 09:57:50.455887 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 131555] 09:57:50.455963 read(4, "\0\0\0
\v\10\372\371(\256V\16$\17\375Z5\252Km-\357\311~A\352\202\232\365\234\33>"...,
16384) = 44
[pid 131555] 09:57:50.456064 select(14, [4 5 11 13], [10], NULL, NULL) = 1 (out
[10])
[pid 131555] 09:57:50.456225 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 131555] 09:57:50.456745 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 131555] 09:57:50.456950 write(10, "\0\0\0\7", 4) = 4
 | 00000  00 00 00 07                                       ....             |
[pid 131555] 09:57:50.457163 select(14, [4 5 11 13], [], NULL, NULL) = 1 (in
[4])
[pid 131555] 09:57:53.459934 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 131555] 09:57:53.460391 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 131555] 09:57:53.460602 read(4, "\0\0\0
\303M\204S\275D\23r\225a\210\247Q\314\256\373]\\C\27\246\241\265\212\235\210\312o"...,
16384) = 44
[pid 131555] 09:57:53.460817 select(14, [4 5 11 13], [10], NULL, NULL) = 1 (out
[10])
[pid 131555] 09:57:53.461023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 131555] 09:57:53.461221 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 131555] 09:57:53.461314 write(10, "\0\0\0\7", 4) = 4
 | 00000  00 00 00 07                                       ....             |
[pid 131555] 09:57:53.461526 select(14, [4 5 11 13], [], NULL, NULL) = 1 (in
[4])
[pid 131555] 09:57:56.463410 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 131555] 09:57:56.463858 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 131555] 09:57:56.464211 read(4, "\0\0\0
\367\270\356\21\304-u\200cQ\16\350UAo\264*\231B\303\275\250\221Nzo\33\231"...,
16384) = 44
[pid 131555] 09:57:56.464336 select(14, [4 5 11 13], [10], NULL, NULL) = 1 (out
[10])
[pid 131555] 09:57:56.464684 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 131555] 09:57:56.464915 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 131555] 09:57:56.465239 write(10, "\0\0\0\7", 4) = 4
 | 00000  00 00 00 07                                       ....             |
[pid 131555] 09:57:56.465354 select(14, [4 5 11 13], [], NULL, NULL^Z
[1]+  Angehalten              strace -f -p 131431 -tt -e write=all
root at debian:~# ps -ef|grep 131555
root      131555  131431  0 09:57 ?        00:00:00 sshd: root at notty
root      131562  131555  3 09:57 ?        00:00:02 rsync --server --sender
-vlogDtpre.iLsfxCIvu --timeout=5 . /
root      131565  131540  0 09:58 pts/1    00:00:00 grep 131555
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
samba-bugs at samba.org
2022-Sep-11  08:07 UTC
[Bug 15163] rsync timeout non-effective and incorrect behaviour
https://bugzilla.samba.org/show_bug.cgi?id=15163 --- Comment #4 from roland <devzero at web.de> --- here with debug=all working correctly # /root/rsync/rsync-3.2.5/rsync -avi --timeout=5 --exclude='/proc' --exclude='/dev/' --exclude='/sys' --debug=all --msgs2stderr root at 172.20.37.189:/iscsipool /zfspool/backup/172.20.37.189/backup opening connection using: ssh -l root 172.20.37.189 rsync --server --sender -vlogDtpre.iLsfxCIvu --msgs2stderr --timeout=5 . /iscsipool (12 args) (Client) Protocol versions: remote=31, negotiated=31 Client negotiated checksum: md5 receiving incremental file list [Receiver] send_msg(0, 36) [Receiver] io timeout after 6 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(197) [Receiver=3.2.5] [Receiver] _exit_cleanup(code=30, file=io.c, line=197): about to call exit(30) hang forever # /root/rsync/rsync-3.2.5/rsync -avi --timeout=5 --exclude='/proc' --exclude='/dev/' --exclude='/sys' --debug=all --msgs2stderr root at 172.20.37.189:/ /zfspool/backup/172.20.37.189/backup 2>&1 |head -n50 opening connection using: ssh -l root 172.20.37.189 rsync --server --sender -vlogDtpre.iLsfxCIvu --msgs2stderr --timeout=5 . / (12 args) (Client) Protocol versions: remote=31, negotiated=31 Client negotiated checksum: md5 receiving incremental file list [Receiver] send_msg(0, 36) [Receiver] got msg=0, len=0 [Receiver] got msg=0, len=30051 get_local_name count=25 /zfspool/backup/172.20.37.189/backup [Receiver] change_dir(/zfspool/backup/172.20.37.189/backup) generator starting pid=9452 delta-transmission enabled recv_generator(.,0) recv_files(25) starting recv_generator(.,1) <snipp> recv_generator(etc/ssl/certs/f39fc864.0,1770) recv_generator(etc/ssl/certs/f51bb24c.0,1771) recv_generator(etc/ssl/certs/fc5a8f99.0,1772) recv_generator(etc/ssl/certs/fe8a2cd8.0,1773) recv_generator(etc/ssl/certs/ff34af3f.0,1774) [receiver] send_msg(0, 0) [generator] send_msg(0, 0) [generator] got msg=0, len=0 [generator] send_msg(0, 0) [generator] send_msg(0, 0) [generator] send_msg(0, 0) [generator] send_msg(0, 0) [generator] send_msg(0, 0) [generator] send_msg(0, 0) -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2022-Sep-11  08:45 UTC
[Bug 15163] rsync timeout non-effective and incorrect behaviour
https://bugzilla.samba.org/show_bug.cgi?id=15163
--- Comment #5 from roland <devzero at web.de> ---
apparently, this is causing the problem:
        if (am_receiver) {
                return;
        }
if i comment out the return statement, things work again.
@wayne, what is the reason that timeout checking is being skipped when
"am_receiver" = true ? 
I don't understand the code well enough ....
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
samba-bugs at samba.org
2022-Dec-13  21:37 UTC
[Bug 15163] rsync timeout non-effective and incorrect behaviour
https://bugzilla.samba.org/show_bug.cgi?id=15163 --- Comment #6 from roland <devzero at web.de> --- nobody has a clue? i think proper rsync timeout handling is important. i have had whole nightly backup procedures hung for the whole night because rsync got stuck and didn't get timeout, i.e. machines did not get proper backup because of this. i would like to help making this more reliable. i think i have provided some useful input for it. -- You are receiving this mail because: You are the QA Contact for the bug.