thr3ads.net - rsync - rsync hang, more details [LONG] [Dec 2001]

If this information is useful, please help other people find it:
Share via:

Ed Santiago

2001-Dec-18 03:22 UTC

rsync hang, more details [LONG]

rsync 2.5.0 still has a bug where it hangs under some circumstances.

The hang is beyond my abilities to track down.  I'll keep trying,
though, but here are details in case they're of use to anyone else:

  - Code configured & built on Solaris 2.5.1.
  - Same binary run on Solaris 2.5.1 (client) and 2.8 (server).
  - Using rsh transport, but also fails with ssh
  - Does not fail with local-local rsync

  - Source directory (on server) is NFS-mounted, from NetApp filer
  - Destination directory (on client) is local (tested NFS, also hangs)

  - Consistently hangs with -vv, never (so far) with -vvv

Included below are three stack traces, one on the server and two
on the client.  This is a pretty consistent feature: The client
and server appear to be deadlocked waiting for each other.

Also attached below are a script for populating a sample hierarchy,
and the rsync invocation.

Backtrace on server:

  #0  0xff218224 in _poll ()
  #1  0xff1cb808 in _select ()
  #2  0x24bec in writefd_unbuffered (fd=1, buf=0xffbe5ed0 ">",
len=66)
      at io.c:406
  #3  0x24eac in mplex_write (fd=1, code=62, buf=0x591d8 "\a\020",
len=62)
      at io.c:498
  #4  0x24f24 in io_flush () at io.c:518
  #5  0x24940 in readfd (fd=0, buffer=0xffbe7020 "?\002r\215?/\201R",
N=4)
      at io.c:314
  #6  0x24998 in read_int (f=0) at io.c:329
  #7  0x199a4 in send_files (flist=0x574e8, f_out=1, f_in=0) at sender.c:110
  #8  0x1d1e8 in do_server_sender (f_in=0, f_out=1, argc=1, argv=0x56f74)
      at main.c:300
  #9  0x1d708 in start_server (f_in=0, f_out=1, argc=2, argv=0x56f70)
      at main.c:476
  #10 0x1e08c in main (argc=2, argv=0x56f70) at main.c:838

Backtrace #1 on client (the parent):

  #0  0xef5b7904 in _poll ()
  #1  0xef5d3d40 in _select ()
  #2  0x24644 in read_timeout (fd=6, buf=0xeffff348 "????", len=4) at
io.c:191
  #3  0x247dc in read_unbuffered (fd=6, buf=0xeffff348 "????", len=4)
at io.c:263
  #4  0x24950 in readfd (fd=6, buffer=0xeffff348 "????", N=4) at
io.c:316
  #5  0x24998 in read_int (f=6) at io.c:329
  #6  0x184e8 in generate_files (f=5, flist=0x57520, local_name=0x0, f_recv=6)
      at generator.c:471
  #7  0x1d3fc in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0)
      at main.c:379
  #8  0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1, argv=0x56f74)
      at main.c:558
  #9  0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731
  #10 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841

Backtrace #2 on client (child):

  #0  0xef5b7904 in _poll ()
  #1  0xef5d3d40 in _select ()
  #2  0x24644 in read_timeout (fd=4, buf=0xefffe680 "", len=4) at
io.c:191
  #3  0x24788 in read_loop (fd=4, buf=0xefffe680 "", len=4) at
io.c:242
  #4  0x24824 in read_unbuffered (fd=4, buf=0xefffe680 "", len=4) at
io.c:268
  #5  0x24950 in readfd (fd=4, buffer=0xefffe680 "", N=4) at io.c:316
  #6  0x24998 in read_int (f=4) at io.c:329
  #7  0x18eec in recv_files (f_in=4, flist=0x57520, local_name=0x0, f_gen=8)
      at receiver.c:328
  #8  0x1d374 in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0)
      at main.c:357
  #9  0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1, argv=0x56f74)
      at main.c:558
  #10 0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731
  #11 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841

--------

The above rsync compiled from CVS repository on Thursday 13 Dec,
early AM.  I've just now (Mon 17 Dec 08:19 Mountain Time) updated,
rebuilt, and rerun the tests.  Same hang.


The script below can be used to populate a directory hierarchy.  It
creates subdirectories 00 through 99 under "src-test/CVSROOT"
(up to you to mkdir that), then a number of files in each subdir:

-------------- next part --------------
#!/sw/tools/bin/Perl -w

use strict;

# up to caller to do:   mkdir -p src-test/CVSROOT
my $sub = 'src-test/CVSROOT';
-d $sub
  or die "You're cd'ed to the wrong directory\n";

foreach my $i (0..99) {
  my $d = sprintf("%02d", $i);
  mkdir "$sub/$d", 02775;

  foreach my $j (1..99) {
    my $f = "$sub/$d/$j$d";
    open  OUT, '>', $f;
    print OUT $f, "\n";
    close OUT or die "error writing $f: $!\n";
  }
}
-------------- next part --------------
This is the rsync invocation:

-------------- next part --------------
#!/bin/sh

CMD=/home/santiago/src/rsync/rsync/rsync.solaris

$CMD	-z -avv --stats --delete \
	--rsync-path=$CMD.nopur					\
	--timeout=600						\
	"cvsroot.eng.ascend.com:/home/santiago/tmp/rsync-test/src-test/CVSROOT"
./results
-------------- next part --------------
Thanks in advance for any help,
^E
-- 
Ed Santiago                 Toolsmith                 santiago@ascend.com

Eric Whiting

2001-Dec-18 03:38 UTC

head link

rsync hang, more details [LONG]

I'm running 2.5.1pre3 and seeing lots hangs as well. Under
2.4.6+Waynes_nohang, I didn't have trouble this bad before. 

SRC: solaris 2.7, netapps nfs tree
DST: solaris 2.8, linux 2.[2,4].*
TRANSPORT: ssh

This setup has worked well for months before the upgrade to 2.5.1.pre3.

I have not tried the -vvv. I'll try that and see what it does.

Sure seems like a timing problem. 

eric





Ed Santiago wrote:> 
> rsync 2.5.0 still has a bug where it hangs under some circumstances.
> 
> The hang is beyond my abilities to track down.  I'll keep trying,
> though, but here are details in case they're of use to anyone else:
> 
>   - Code configured & built on Solaris 2.5.1.
>   - Same binary run on Solaris 2.5.1 (client) and 2.8 (server).
>   - Using rsh transport, but also fails with ssh
>   - Does not fail with local-local rsync
> 
>   - Source directory (on server) is NFS-mounted, from NetApp filer
>   - Destination directory (on client) is local (tested NFS, also hangs)
> 
>   - Consistently hangs with -vv, never (so far) with -vvv
> 
> Included below are three stack traces, one on the server and two
> on the client.  This is a pretty consistent feature: The client
> and server appear to be deadlocked waiting for each other.
> 
> Also attached below are a script for populating a sample hierarchy,
> and the rsync invocation.
> 
> Backtrace on server:
> 
>   #0  0xff218224 in _poll ()
>   #1  0xff1cb808 in _select ()
>   #2  0x24bec in writefd_unbuffered (fd=1, buf=0xffbe5ed0 ">",
len=66)
>       at io.c:406
>   #3  0x24eac in mplex_write (fd=1, code=62, buf=0x591d8
"\a\020", len=62)
>       at io.c:498
>   #4  0x24f24 in io_flush () at io.c:518
>   #5  0x24940 in readfd (fd=0, buffer=0xffbe7020
"?\002r\215?/\201R", N=4)
>       at io.c:314
>   #6  0x24998 in read_int (f=0) at io.c:329
>   #7  0x199a4 in send_files (flist=0x574e8, f_out=1, f_in=0) at
sender.c:110
>   #8  0x1d1e8 in do_server_sender (f_in=0, f_out=1, argc=1, argv=0x56f74)
>       at main.c:300
>   #9  0x1d708 in start_server (f_in=0, f_out=1, argc=2, argv=0x56f70)
>       at main.c:476
>   #10 0x1e08c in main (argc=2, argv=0x56f70) at main.c:838
> 
> Backtrace #1 on client (the parent):
> 
>   #0  0xef5b7904 in _poll ()
>   #1  0xef5d3d40 in _select ()
>   #2  0x24644 in read_timeout (fd=6, buf=0xeffff348 "????",
len=4) at io.c:191
>   #3  0x247dc in read_unbuffered (fd=6, buf=0xeffff348 "????",
len=4) at io.c:263
>   #4  0x24950 in readfd (fd=6, buffer=0xeffff348 "????", N=4) at
io.c:316
>   #5  0x24998 in read_int (f=6) at io.c:329
>   #6  0x184e8 in generate_files (f=5, flist=0x57520, local_name=0x0,
f_recv=6)
>       at generator.c:471
>   #7  0x1d3fc in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0)
>       at main.c:379
>   #8  0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1,
argv=0x56f74)
>       at main.c:558
>   #9  0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731
>   #10 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841
> 
> Backtrace #2 on client (child):
> 
>   #0  0xef5b7904 in _poll ()
>   #1  0xef5d3d40 in _select ()
>   #2  0x24644 in read_timeout (fd=4, buf=0xefffe680 "", len=4) at
io.c:191
>   #3  0x24788 in read_loop (fd=4, buf=0xefffe680 "", len=4) at
io.c:242
>   #4  0x24824 in read_unbuffered (fd=4, buf=0xefffe680 "", len=4)
at io.c:268
>   #5  0x24950 in readfd (fd=4, buffer=0xefffe680 "", N=4) at
io.c:316
>   #6  0x24998 in read_int (f=4) at io.c:329
>   #7  0x18eec in recv_files (f_in=4, flist=0x57520, local_name=0x0,
f_gen=8)
>       at receiver.c:328
>   #8  0x1d374 in do_recv (f_in=4, f_out=5, flist=0x57520, local_name=0x0)
>       at main.c:357
>   #9  0x1d958 in client_run (f_in=4, f_out=5, pid=22226, argc=1,
argv=0x56f74)
>       at main.c:558
>   #10 0x1ddc0 in start_client (argc=1, argv=0x56f74) at main.c:731
>   #11 0x1e098 in main (argc=2, argv=0x56f70) at main.c:841
> 
> --------
> 
> The above rsync compiled from CVS repository on Thursday 13 Dec,
> early AM.  I've just now (Mon 17 Dec 08:19 Mountain Time) updated,
> rebuilt, and rerun the tests.  Same hang.
> 
> The script below can be used to populate a directory hierarchy.  It
> creates subdirectories 00 through 99 under "src-test/CVSROOT"
> (up to you to mkdir that), then a number of files in each subdir:
> 
>   ------------------------------------------------------------------------
> #!/sw/tools/bin/Perl -w
> 
> use strict;
> 
> # up to caller to do:   mkdir -p src-test/CVSROOT
> my $sub = 'src-test/CVSROOT';
> -d $sub
>   or die "You're cd'ed to the wrong directory\n";
> 
> foreach my $i (0..99) {
>   my $d = sprintf("%02d", $i);
>   mkdir "$sub/$d", 02775;
> 
>   foreach my $j (1..99) {
>     my $f = "$sub/$d/$j$d";
>     open  OUT, '>', $f;
>     print OUT $f, "\n";
>     close OUT or die "error writing $f: $!\n";
>   }
> }
> 
>   ------------------------------------------------------------------------
> This is the rsync invocation:
> 
>   ------------------------------------------------------------------------
> #!/bin/sh
> 
> CMD=/home/santiago/src/rsync/rsync/rsync.solaris
> 
> $CMD    -z -avv --stats --delete \
>         --rsync-path=$CMD.nopur                                 \
>         --timeout=600                                           \
>        
"cvsroot.eng.ascend.com:/home/santiago/tmp/rsync-test/src-test/CVSROOT"
./results
> 
>   ------------------------------------------------------------------------
> Thanks in advance for any help,
> ^E
> --
> Ed Santiago                 Toolsmith                 santiago@ascend.com

Martin Pool

2001-Dec-18 12:49 UTC

head link

rsync hang, more details [LONG]

On 17 Dec 2001, Ed Santiago <santiago@ascend.com>
wrote:> rsync 2.5.0 still has a bug where it hangs under some circumstances.
> 
> The hang is beyond my abilities to track down.
The other thing you absolutely need to send is the output of 

netstat -ta

while the program is running.  If netstat on your platform supports
other options to get more info like -p -o -e then you can specify them
as well.

Thanks.

-- 
Martin

Eric Whiting

2001-Dec-18 14:59 UTC

head link

rsync hang, more details [LONG]

I have two rsync 2.5.1pre3 sessions hung right now.  In this case it
appears that the processes at the destination have exited and the src is
still waiting for something.

Here are some more details:

SRC: solaris 2.7 (2G RAM 2 CPU)
DST: linux  (one is 2.4.8, the other is 2.2.18)
Transport: ssh


PS FROM THE SOURCE shows two ssh channels open 
(parallel rsync thanks to a simple perl script)
---------------------------------------------------------------

# ps -aef |grep ssh
    root  9893  9885  0 19:08:05 ?        0:02 /usr/local/bin/ssh
penmtc   /usr/bin/rsync --server -vvlWogDtpr --timeout=6000
    root  9879  9873  0 19:08:05 ?        0:01 /usr/local/bin/ssh
penkjose /usr/bin/rsync --server -vvlWogDtpr --timeout=6000


PS FROM THE SOURCE SHOWS normal rsync processes
(the two important ones are 9885 and 9873)
-----------------------------------------------

# ps -aef |grep rsync
    root  9885  9877  0 19:08:05 ?        0:25
/usr1/tis/sunos6/bin/rsync -W -a --rsync-path=/usr/bin/rsync --delete
--partial
    root  9870  9839  0 19:08:05 ?        0:00 sh -c
/usr1/tis/sunos6/bin/rsync  -W  -a --rsync-path=/usr/bin/rsync --delete
-
    root  9893  9885  0 19:08:05 ?        0:02 /usr/local/bin/ssh
penmtc   /usr/bin/rsync --server -vvlWogDtpr --timeout=6000
    root  9879  9873  0 19:08:05 ?        0:01 /usr/local/bin/ssh
penkjose /usr/bin/rsync --server -vvlWogDtpr --timeout=6000
    root  9877  9853  0 19:08:05 ?        0:00 sh -c
/usr1/tis/sunos6/bin/rsync  -W  -a 
    root  9873  9870  0 19:08:05 ?        0:24
/usr1/tis/sunos6/bin/rsync -W -a --rsync-path=/usr/bin/rsync --delete
--partial



# truss -p 9873
poll(0xFFBE6F38, 1, 6000000)    (sleeping...)

# truss -p 9885
poll(0xFFBE6F30, 1, 6000000)    (sleeping...)




Next checking the destinations for rsync processes:
--------------------------------------------------

# ssh penkjose
Last login: Mon Dec 17 13:48:33 2001 from herc
Have a lot of fun...
penkjose:~ # ps -aef |grep rsync
root      6367  6357  0 19:47 pts/0    00:00:00 grep rsync
penkjose:~ # exit



# ssh penmtc
Last login: Mon Dec 17 19:48:01 2001 from herc
penmtc:~ # ps -aef |grep rsync
root      1187  1177  0 19:48 pts/0    00:00:00 grep rsync
penmtc:~ # exit


It is my experience that once they timeout it will continue, but it is
odd that the three rsync processes are not coordinating things better.


eric







Martin Pool wrote:> 
> On 17 Dec 2001, Ed Santiago <santiago@ascend.com> wrote:
> > rsync 2.5.0 still has a bug where it hangs under some circumstances.
> >
> > The hang is beyond my abilities to track down.
> 
> The other thing you absolutely need to send is the output of
> 
> netstat -ta
> 
> while the program is running.  If netstat on your platform supports
> other options to get more info like -p -o -e then you can specify them
> as well.
> 
> Thanks.
> 
> --
> Martin

Possibly Parallel Threads

[Bug 560] Privsep child continues to run after monitor killed.

rsync - Dec 2001 - rsync hang, more details [LONG]

rsync hang, more details [LONG]

rsync hang, more details [LONG]

rsync hang, more details [LONG]

rsync hang, more details [LONG]

Possibly Parallel Threads