Damien Miller wrote:>
> On Tue, 28 Nov 2000, Theron Tock wrote:
>
> > I'm using rsync over ssh to do some backups from a redhat 6.2
machine
> > and I found that I was able to semi-reproducibly get openssh to hang.
> > Using strace and gdb, it seemed that the problem was due to a
too-large
> > call to write.
>
> What kernel are you using?
kernel: Linux version 2.2.14-5.0smp (root at porky.devel.redhat.com) (gcc
version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 SMP Tue Mar
7 21:01:40 EST 2000
>
> Can you discern anything from an strace of a failing client?
>
Yes -- the rsync client calls write() with somewhere around 50K of data
and the call never returns. It's not 100% repeatable, though it happens
almost every time (but at different points in the rsync). I do have a
theory for why it might be happening. Rsync is communicating with ssh
via a pipe, and rsync is busy shoveling large amounts of data through
ssh to the remote rsync. The pipe fills up and the local rsync goes to
sleep waiting for the pipe to get cleared. Meanwhile, ssh gets a
response from the remote rsync and tries to write it back to the local
rsync, but the fact that the write is so large (50K) and that the read
side (from ssh's perspective) of the pipe is full causes the kernel to
go to sleep waiting for the pipe to clear. Hence the deadlock. It
sounds sort of hokey since I would have expected the read and write
buffers of a pipe to be completely independent of eachother, but if they
are somehow shared for writes >32K then that would explain the problem.
The fact that it happens at different points of the rsync does support
the theory though.
One interesting thing is that if I attach to ssh with strace and then
kill strace by hitting ^C, then ssh picks up and everything merrily
resumes. I'm guessing that killing strace causes the system call to
return with a partial write and everything can then proceed -- similar
to if the write were done with a smaller buffer in the first place.
-Theron