rsync 2.5.0 still has a bug where it hangs under some circumstances. The hang is beyond my abilities to track down. I'll keep trying, though, but here are details in case they're of use to anyone else: - Code configured & built on Solaris 2.5.1. - Same binary run on Solaris 2.5.1 (client) and 2.8 (server). - Using rsh transport, but also fails with ssh - Does not fail with local-local rsync - Source directory (on server) is NFS-mounted, from NetApp filer - Destination directory (on client) is local (tested NFS, also hangs) - Consistently hangs with -vv, never (so far) with -vvv Included below are three stack traces, one on the server and two on the client. This is a pretty consistent feature: The client and server appear to be deadlocked waiting for each other. Also attached below are a script for populating a sample hierarchy, and the rsync invocation. Backtrace on server: #0 0xff218224 in _poll () #1 0xff1cb808 in _select () #2 0x24bec in writefd_unbuffered (fd=1, buf=0xffbe5ed0 ">", len=66) at io.c:406 #3 0x24eac in mplex_write (fd=1, code=62, buf=0x591d8 "\a\020", len=62) at io.c:498 #4 0x24f24 in io_flush () at io.c:518 #5 0x24940 in readfd (fd=0, buffer=0xffbe7020 "?\002r\215?/\201R", N=4) at io.c:314 #6 0x24998 in read_int (f=0) at io.c:329 #7 0x199a4 in send_files (flist=0x574e8, f_out=1, f_in=0) at sender.c:110 #8 0x1d1e8 in do_server_sender (f_in=0, f_out=1, argc=1, argv=0x56f74) at main.c:300 #9 0x1d708 in start_server (f_in=0, f_out=1, argc=2, argv=0x56f70) at main.c:476 #10 0x1e08c in main (argc=2, argv=0x56f70) at main.c:838 Backtrace #1 on client (the parent): #0 0xef5b7904 in _poll () #1 0xef5d3d40 in _select () #2 0x24644 in read_timeout (fd=6, buf=0xeffff348 "????", len=4) at io.c:191 #3 0x247dc in read_unbuffered (fd=6, buf=0xeffff348 "????", len=4) at io.c:263 #4 0x24950 in readfd (fd=6, buffer=0xeffff348 "????", N=4) at io.c:316 #5 0x24998 in read_int (f=6) at io.c:329 #6 0x184e8 in generate_files (f=5, flist=0x57520, local_name=0x0, f_recv=6) at generator.c:471 #7 0x1d3fc in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0) at main.c:379 #8 0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1, argv=0x56f74) at main.c:558 #9 0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731 #10 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841 Backtrace #2 on client (child): #0 0xef5b7904 in _poll () #1 0xef5d3d40 in _select () #2 0x24644 in read_timeout (fd=4, buf=0xefffe680 "", len=4) at io.c:191 #3 0x24788 in read_loop (fd=4, buf=0xefffe680 "", len=4) at io.c:242 #4 0x24824 in read_unbuffered (fd=4, buf=0xefffe680 "", len=4) at io.c:268 #5 0x24950 in readfd (fd=4, buffer=0xefffe680 "", N=4) at io.c:316 #6 0x24998 in read_int (f=4) at io.c:329 #7 0x18eec in recv_files (f_in=4, flist=0x57520, local_name=0x0, f_gen=8) at receiver.c:328 #8 0x1d374 in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0) at main.c:357 #9 0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1, argv=0x56f74) at main.c:558 #10 0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731 #11 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841 -------- The above rsync compiled from CVS repository on Thursday 13 Dec, early AM. I've just now (Mon 17 Dec 08:19 Mountain Time) updated, rebuilt, and rerun the tests. Same hang. The script below can be used to populate a directory hierarchy. It creates subdirectories 00 through 99 under "src-test/CVSROOT" (up to you to mkdir that), then a number of files in each subdir: -------------- next part -------------- #!/sw/tools/bin/Perl -w use strict; # up to caller to do: mkdir -p src-test/CVSROOT my $sub = 'src-test/CVSROOT'; -d $sub or die "You're cd'ed to the wrong directory\n"; foreach my $i (0..99) { my $d = sprintf("%02d", $i); mkdir "$sub/$d", 02775; foreach my $j (1..99) { my $f = "$sub/$d/$j$d"; open OUT, '>', $f; print OUT $f, "\n"; close OUT or die "error writing $f: $!\n"; } } -------------- next part -------------- This is the rsync invocation: -------------- next part -------------- #!/bin/sh CMD=/home/santiago/src/rsync/rsync/rsync.solaris $CMD -z -avv --stats --delete \ --rsync-path=$CMD.nopur \ --timeout=600 \ "cvsroot.eng.ascend.com:/home/santiago/tmp/rsync-test/src-test/CVSROOT" ./results -------------- next part -------------- Thanks in advance for any help, ^E -- Ed Santiago Toolsmith santiago@ascend.com
I'm running 2.5.1pre3 and seeing lots hangs as well. Under 2.4.6+Waynes_nohang, I didn't have trouble this bad before. SRC: solaris 2.7, netapps nfs tree DST: solaris 2.8, linux 2.[2,4].* TRANSPORT: ssh This setup has worked well for months before the upgrade to 2.5.1.pre3. I have not tried the -vvv. I'll try that and see what it does. Sure seems like a timing problem. eric Ed Santiago wrote:> > rsync 2.5.0 still has a bug where it hangs under some circumstances. > > The hang is beyond my abilities to track down. I'll keep trying, > though, but here are details in case they're of use to anyone else: > > - Code configured & built on Solaris 2.5.1. > - Same binary run on Solaris 2.5.1 (client) and 2.8 (server). > - Using rsh transport, but also fails with ssh > - Does not fail with local-local rsync > > - Source directory (on server) is NFS-mounted, from NetApp filer > - Destination directory (on client) is local (tested NFS, also hangs) > > - Consistently hangs with -vv, never (so far) with -vvv > > Included below are three stack traces, one on the server and two > on the client. This is a pretty consistent feature: The client > and server appear to be deadlocked waiting for each other. > > Also attached below are a script for populating a sample hierarchy, > and the rsync invocation. > > Backtrace on server: > > #0 0xff218224 in _poll () > #1 0xff1cb808 in _select () > #2 0x24bec in writefd_unbuffered (fd=1, buf=0xffbe5ed0 ">", len=66) > at io.c:406 > #3 0x24eac in mplex_write (fd=1, code=62, buf=0x591d8 "\a\020", len=62) > at io.c:498 > #4 0x24f24 in io_flush () at io.c:518 > #5 0x24940 in readfd (fd=0, buffer=0xffbe7020 "?\002r\215?/\201R", N=4) > at io.c:314 > #6 0x24998 in read_int (f=0) at io.c:329 > #7 0x199a4 in send_files (flist=0x574e8, f_out=1, f_in=0) at sender.c:110 > #8 0x1d1e8 in do_server_sender (f_in=0, f_out=1, argc=1, argv=0x56f74) > at main.c:300 > #9 0x1d708 in start_server (f_in=0, f_out=1, argc=2, argv=0x56f70) > at main.c:476 > #10 0x1e08c in main (argc=2, argv=0x56f70) at main.c:838 > > Backtrace #1 on client (the parent): > > #0 0xef5b7904 in _poll () > #1 0xef5d3d40 in _select () > #2 0x24644 in read_timeout (fd=6, buf=0xeffff348 "????", len=4) at io.c:191 > #3 0x247dc in read_unbuffered (fd=6, buf=0xeffff348 "????", len=4) at io.c:263 > #4 0x24950 in readfd (fd=6, buffer=0xeffff348 "????", N=4) at io.c:316 > #5 0x24998 in read_int (f=6) at io.c:329 > #6 0x184e8 in generate_files (f=5, flist=0x57520, local_name=0x0, f_recv=6) > at generator.c:471 > #7 0x1d3fc in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0) > at main.c:379 > #8 0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1, argv=0x56f74) > at main.c:558 > #9 0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731 > #10 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841 > > Backtrace #2 on client (child): > > #0 0xef5b7904 in _poll () > #1 0xef5d3d40 in _select () > #2 0x24644 in read_timeout (fd=4, buf=0xefffe680 "", len=4) at io.c:191 > #3 0x24788 in read_loop (fd=4, buf=0xefffe680 "", len=4) at io.c:242 > #4 0x24824 in read_unbuffered (fd=4, buf=0xefffe680 "", len=4) at io.c:268 > #5 0x24950 in readfd (fd=4, buffer=0xefffe680 "", N=4) at io.c:316 > #6 0x24998 in read_int (f=4) at io.c:329 > #7 0x18eec in recv_files (f_in=4, flist=0x57520, local_name=0x0, f_gen=8) > at receiver.c:328 > #8 0x1d374 in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0) > at main.c:357 > #9 0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1, argv=0x56f74) > at main.c:558 > #10 0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731 > #11 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841 > > -------- > > The above rsync compiled from CVS repository on Thursday 13 Dec, > early AM. I've just now (Mon 17 Dec 08:19 Mountain Time) updated, > rebuilt, and rerun the tests. Same hang. > > The script below can be used to populate a directory hierarchy. It > creates subdirectories 00 through 99 under "src-test/CVSROOT" > (up to you to mkdir that), then a number of files in each subdir: > > ------------------------------------------------------------------------ > #!/sw/tools/bin/Perl -w > > use strict; > > # up to caller to do: mkdir -p src-test/CVSROOT > my $sub = 'src-test/CVSROOT'; > -d $sub > or die "You're cd'ed to the wrong directory\n"; > > foreach my $i (0..99) { > my $d = sprintf("%02d", $i); > mkdir "$sub/$d", 02775; > > foreach my $j (1..99) { > my $f = "$sub/$d/$j$d"; > open OUT, '>', $f; > print OUT $f, "\n"; > close OUT or die "error writing $f: $!\n"; > } > } > > ------------------------------------------------------------------------ > This is the rsync invocation: > > ------------------------------------------------------------------------ > #!/bin/sh > > CMD=/home/santiago/src/rsync/rsync/rsync.solaris > > $CMD -z -avv --stats --delete \ > --rsync-path=$CMD.nopur \ > --timeout=600 \ > "cvsroot.eng.ascend.com:/home/santiago/tmp/rsync-test/src-test/CVSROOT" ./results > > ------------------------------------------------------------------------ > Thanks in advance for any help, > ^E > -- > Ed Santiago Toolsmith santiago@ascend.com
On 17 Dec 2001, Ed Santiago <santiago@ascend.com> wrote:> rsync 2.5.0 still has a bug where it hangs under some circumstances. > > The hang is beyond my abilities to track down.The other thing you absolutely need to send is the output of netstat -ta while the program is running. If netstat on your platform supports other options to get more info like -p -o -e then you can specify them as well. Thanks. -- Martin
I have two rsync 2.5.1pre3 sessions hung right now. In this case it appears that the processes at the destination have exited and the src is still waiting for something. Here are some more details: SRC: solaris 2.7 (2G RAM 2 CPU) DST: linux (one is 2.4.8, the other is 2.2.18) Transport: ssh PS FROM THE SOURCE shows two ssh channels open (parallel rsync thanks to a simple perl script) --------------------------------------------------------------- # ps -aef |grep ssh root 9893 9885 0 19:08:05 ? 0:02 /usr/local/bin/ssh penmtc /usr/bin/rsync --server -vvlWogDtpr --timeout=6000 root 9879 9873 0 19:08:05 ? 0:01 /usr/local/bin/ssh penkjose /usr/bin/rsync --server -vvlWogDtpr --timeout=6000 PS FROM THE SOURCE SHOWS normal rsync processes (the two important ones are 9885 and 9873) ----------------------------------------------- # ps -aef |grep rsync root 9885 9877 0 19:08:05 ? 0:25 /usr1/tis/sunos6/bin/rsync -W -a --rsync-path=/usr/bin/rsync --delete --partial root 9870 9839 0 19:08:05 ? 0:00 sh -c /usr1/tis/sunos6/bin/rsync -W -a --rsync-path=/usr/bin/rsync --delete - root 9893 9885 0 19:08:05 ? 0:02 /usr/local/bin/ssh penmtc /usr/bin/rsync --server -vvlWogDtpr --timeout=6000 root 9879 9873 0 19:08:05 ? 0:01 /usr/local/bin/ssh penkjose /usr/bin/rsync --server -vvlWogDtpr --timeout=6000 root 9877 9853 0 19:08:05 ? 0:00 sh -c /usr1/tis/sunos6/bin/rsync -W -a root 9873 9870 0 19:08:05 ? 0:24 /usr1/tis/sunos6/bin/rsync -W -a --rsync-path=/usr/bin/rsync --delete --partial # truss -p 9873 poll(0xFFBE6F38, 1, 6000000) (sleeping...) # truss -p 9885 poll(0xFFBE6F30, 1, 6000000) (sleeping...) Next checking the destinations for rsync processes: -------------------------------------------------- # ssh penkjose Last login: Mon Dec 17 13:48:33 2001 from herc Have a lot of fun... penkjose:~ # ps -aef |grep rsync root 6367 6357 0 19:47 pts/0 00:00:00 grep rsync penkjose:~ # exit # ssh penmtc Last login: Mon Dec 17 19:48:01 2001 from herc penmtc:~ # ps -aef |grep rsync root 1187 1177 0 19:48 pts/0 00:00:00 grep rsync penmtc:~ # exit It is my experience that once they timeout it will continue, but it is odd that the three rsync processes are not coordinating things better. eric Martin Pool wrote:> > On 17 Dec 2001, Ed Santiago <santiago@ascend.com> wrote: > > rsync 2.5.0 still has a bug where it hangs under some circumstances. > > > > The hang is beyond my abilities to track down. > > The other thing you absolutely need to send is the output of > > netstat -ta > > while the program is running. If netstat on your platform supports > other options to get more info like -p -o -e then you can specify them > as well. > > Thanks. > > -- > Martin