(Versions: OpenSSH_3.7.1p2, rsync version 2.6.2) I've just encountered a situation where "rsync -v -n" appears to run normally, but reports many fewer file transfers than actually get done when you remove the -n. (This is not one of the usual "-n" corner cases.) It turns out that this only happens when you're doing a remote rsync over ssh AND you redirect stderr into a pipe that fills up, as in rsync -e ssh -avn host:/path /local/path 2>&1 | tee LOG I can get the right answer by just not capturing stderr; i.e. removing the "2>&1" and just saying rsync -avn host:/path /local/path | tee LOG works. The data loss occurs when the pipe (to tee here) fills, so in principle you could lose output even without the "-n", it's just less likely when the output is generated slower. After poking around with strace, it seems that rsync's child ssh sets its stdERR non-blocking, and that stderr has been inherited unchanged from the top-level rsync. (The rsync has supplied pipes for its child's stdin and stdout, but left the stderr alone; see rsync-2.6.2/pipe.c::piped_child().) Because of the "2>&1", the top-level stderr is a dup of the top-level stdout, so ssh has inadvertantly made rsync's stdOUT non-blocking. Rsync is not expecting that, and does not check the return code from fflush(stdout), so it can silently drop lines from stdout. (See the end of rsync-2.6.2/log.c::rwrite().) CVS has basically the same problem, as discussed at http://groups.google.com/groups?th=e4df2fdc1f4f4950, which mentions some workarounds that the CVS people considered. It's not clear whether the problem should really be fixed in rsync, ssh, or glibc, but in the meantime, would it be worth adding a warning to the docs/FAQ/known-issues/wherever?
On Tue, Sep 28, 2004 at 04:10:13PM +0000, David Evers wrote:> rsync -e ssh -avn host:/path /local/path 2>&1 | tee LOGIn my test I piped the output to "(sleep 10; tail)" to ensure a reproducable truncation. The attached patch fixes the problem by putting our stderr fd back into blocking I/O mode. I don't see why ssh should be playing with our stderr fd in the first place (since we're the one calling ssh, not the one being run by ssh). Does anyone see a problem with this change? ..wayne.. -------------- next part -------------- --- main.c 17 Sep 2004 16:50:53 -0000 1.217 +++ main.c 28 Sep 2004 16:42:09 -0000 @@ -657,6 +657,9 @@ int client_run(int f_in, int f_out, pid_ if (protocol_version >= 23 && !read_batch) io_start_multiplex_in(); + /* Work around a bug in ssh that sets our STDERR to non-blocking. */ + set_blocking(STDERR_FILENO); + if (am_sender) { keep_dirlinks = 0; /* Must be disabled on the sender. */ io_start_buffering_out();
>The attached patch fixes the problem by putting our stderr fd back >into blocking I/O mode. I don't see why ssh should be playing with >our stderr fd in the first place (since we're the one calling ssh, >not the one being run by ssh). Does anyone see a problem with this >change?With this patch, isn't there a race over the non-blocking flag between the child ssh and the parent rsync? I'm not seeing how we know that ssh has finished messing with stderr by the time the patch comes to try to set it back to blocking. More generally, there's presumably a reason why ssh wants to set its stderr non-blocking -- its internal io scheduling does seem to have been quite delicate in the past, so perhaps doing something different might upset it. Would it be worth asking the ssh people what's going on? The clearest picture I've found of the problem is in the message at http://www.mail-archive.com/bug-cvs@gnu.org/msg04280.html That describes a way of ensuring that ssh and its parent have different _file objects_ (in the unix sense) for output destined for stderr, so that each process gets to set the blockingness it wants independent of the other. The downside is that something has to do the copying from ssh's stderr to the parent's stderr. Cheers, ---- David
>The attached patch fixes the problem by putting our stderr fd back >into blocking I/O mode.A bit of digging in the openssh bugzilla throws up this: http://bugzilla.mindrot.org/show_bug.cgi?id=26 which suggests that ssh really does want to keep stderr non-blocking :-( Cheers, ---- David
Possibly Parallel Threads
- help needed using t.test with factors
- How to exclude directories from source with --relative
- testing around_save
- [Bug 13735] New: Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
- rsync is not deleting subdirectories