Hi all,
I'm using rsync to replicate a cvs repository, it is working quite good
except that I occassionnaly see rsync stucked after having transfered a
file, it is always on a big ,v file containing multiple version of big
binary file.
I haven't gathered the netstat information on the frozen process (it is
not easy to reproduce the problem), but I have the strace information :
- the server process is blocked on select system call "select(8, [7],
[], NULL, {60, 0}) = 0 (Timeout)"
- the two clients process are also stuck on the same system call.
I'm using rsync 2.6.8, and I give bellow extract of the strace (the
complete strace are huge).
Is it enough information to have an idea of the problem ?
What else could I do to help debugging this issue ?
Thank you for your help.
--
Lo?c Le Loarer
Here is an extract of the strace of the client process with the highest
pid :
11:20:42 open("toto.gz,v", O_RDONLY) = 1
11:20:42 fstat(1, {st_mode=S_IFREG|0444, st_size=1785405919, ...}) = 0
11:20:42 open(".toto.gz,v.3axXXh", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
11:20:42 fchmod(3, 0400) = 0
....... file transfert, many read and write
11:48:48 read(0, "\24.........", 8184) = 387
11:48:48 write(3, "%I\17H/M\...\226"..., 254657) = 254657
11:48:48 brk(0x588000) = 0x588000
11:48:48 close(1) = 0
11:48:48 close(3) = 0
11:48:48 lstat(".toto.gz,v.3axXXh", {st_mode=S_IFREG|0400,
st_size=2039997121
, ...}) = 0
11:48:48 chmod(".toto.gz,v.3axXXh", 0444) = 0
11:48:48 rename(".toto.gz,v.3axXXh", "toto.gz,v") = 0
11:48:50 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
11:49:50 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
11:50:50 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
... an so on until I kill it
13:28:51 select(1, [0], [], NULL, {60, 0}) = ? ERESTARTNOHAND (To be
restarted)
13:29:19 --- SIGTERM (Terminated) @ 0 (0) ---
13:29:19 select(0, NULL, NULL, NULL, {0, 400000}) = 0 (Timeout)
13:29:20 select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
13:29:20 rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
13:29:20 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
13:29:20 write(4, "\\\0\0\10rsync error: received SIGINT, SIGTERM, or
SIGHUP (code
20) at rsync.c(242) [receiver=2.6.8]\n", 96) = 96
13:29:20 exit_group(20) = ?
Here is an extract of the strace of the server process :
02:48:48 read(7, "\374\17\0\7", 4) = 4
02:48:48 select(8, [7], [], NULL, {60, 0}) = 1 (in [7], left {60, 0})
02:48:48 read(7, "\366....\315"..., 4092) = 1232
02:48:48 select(8, [7], [], NULL, {60, 0}) = 0 (Timeout)
02:49:48 select(8, [7], [], NULL, {60, 0}) = 0 (Timeout)
02:50:48 select(8, [7], [], NULL, {60, 0}) = 0 (Timeout)
... an so on until I kill it
04:28:47 select(8, [7], [], NULL, {60, 0}) = 1 (in [7], left {27, 450000})
04:29:20 read(7, "\\\0\0\10rsync error: received SIGINT, SIGTERM, or
SIGHUP (code 2
0) at rsync.c(242) [receiver=2.6.8]\n", 2860) = 96
04:29:20 select(8, [7], [], NULL, {60, 0}) = 1 (in [7], left {59, 990000})
04:29:20 read(7, "N\0\0\10rsync: connection unexpectedly closed (112
bytes received
so far) [generator]\nZ\0\0\10rsync error: error in rsync protocol data
stream (cod
e 12) at io.c(463) [generator=2.6.8]\n", 2764) = 176
04:29:20 select(8, [7], [], NULL, {60, 0}) = 1 (in [7], left {60, 0})
04:29:20 --- SIGCHLD (Child exited) @ 0 (0) ---
04:29:20 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) =
29245
04:29:20 waitpid(-1, 0xbfff7ba4, WNOHANG) = -1 ECHILD (No child processes)
04:29:20 sigreturn() = ? (mask now [])
04:29:20 read(7, "", 2588) = 0
04:29:20 write(2, "rsync: connection unexpectedly closed (356009 bytes
received so
far) [sender]", 77) = 77
04:29:20 write(2, "\n", 1) = 1
04:29:20 rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
04:29:20 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
04:29:20 waitpid(29245, 0xbfff7e78, WNOHANG) = -1 ECHILD (No child
processes)
04:29:20 getpid() = 29244
04:29:20 kill(29245, SIGUSR1) = -1 ESRCH (No such process)
04:29:20 write(2, "rsync error: error in rsync protocol data stream
(code 12) at io
.c(463) [sender=2.6.8]", 86) = 86
04:29:20 write(2, "\n", 1) = 1
04:29:20 munmap(0xb75f3000, 4096) = 0
04:29:20 exit_group(12) = ?