thr3ads.net - rsync - working on a 2.5.6pre1 release [Jan 2003]

If this information is useful, please help other people find it:
Share via:

Dave Dykstra

2003-Jan-10 00:34 UTC

working on a 2.5.6pre1 release

I'm working on trying to get rsync 2.5.6pre1 available for people to
test more widely.  I'm out of time for today, and I'm stuck on a problem
that some machines on build.samba.org are showing on the 'chgrp' test.
I can reproduce this on my home redhat 7.3 system too.  It appears to be a
timing problem because when I do strace -F -f on it the problem goes away.
Everything seems to go through normally but then it exits with an exit
code of 12, I think because the child receiver process is terminated with
a SIGUSR2 which is signal 12 and because the bug that was preventing exit
codes from being properly reported from children has now been fixed.
It's very hard to debug because it is a timing problem and because it
happens after rprintf handling is already shut down in the child process.
I suspected that maybe the catching of the SIGUSR2 signal was not getting
inherited from its parent, but it doesn't help to re-set it in the child.
Nevertheless, I'm not sure whether or not the sigusr2_handler function
is getting called in the child.  I'd appreciate some help with this if
anybody else thinks they can figure it out.

- Dave Dykstra

jw schultz

2003-Jan-10 01:10 UTC

head link

working on a 2.5.6pre1 release

On Thu, Jan 09, 2003 at 05:09:07PM -0600, Dave Dykstra
wrote:> I'm working on trying to get rsync 2.5.6pre1 available for people to
> test more widely.  I'm out of time for today, and I'm stuck on a
problem
> that some machines on build.samba.org are showing on the 'chgrp'
test.
> I can reproduce this on my home redhat 7.3 system too.  It appears to be a
> timing problem because when I do strace -F -f on it the problem goes away.
> Everything seems to go through normally but then it exits with an exit
> code of 12, I think because the child receiver process is terminated with
> a SIGUSR2 which is signal 12 and because the bug that was preventing exit
> codes from being properly reported from children has now been fixed.
> It's very hard to debug because it is a timing problem and because it
> happens after rprintf handling is already shut down in the child process.
> I suspected that maybe the catching of the SIGUSR2 signal was not getting
> inherited from its parent, but it doesn't help to re-set it in the
child.
> Nevertheless, I'm not sure whether or not the sigusr2_handler function
> is getting called in the child.  I'd appreciate some help with this if
> anybody else thinks they can figure it out.
I haven't pinned it down but the problem appears to have
been introduced in 1.157 of main.c with the lost exist
status patch.  I was able to reproduce your error on the
chgrp test and backing out to 1.156 fixed it.

That at least narrows it down.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

Wayne Davison

2003-Jan-10 02:57 UTC

head link

working on a 2.5.6pre1 release

On Thu, Jan 09, 2003 at 05:09:07PM -0600, Dave Dykstra
wrote:> It's very hard to debug because it is a timing problem and because it
> happens after rprintf handling is already shut down in the child process.
Fortunately fprintf(stderr, ...) always works, even in the child process.
This is what I've been using to get some status on the problem.
> Everything seems to go through normally but then it exits with an exit
> code of 12, I think because the child receiver process is terminated with
> a SIGUSR2 which is signal 12 and because the bug that was preventing exit
> codes from being properly reported from children has now been fixed.
The value of SIGUSR2 is a red herring.  The error is really RERR_STREAMIO,
which is being returned by the whine_about_eof() routine.  I haven't had
time to figure out why this code is getting sent during the final phase of
the life of the receiver yet, though.  The receiver successfully kills the
generator, gets its 0 status code, begins to return a 0 status code, and
then it suddenly starts exit_cleanup() over again with the error 12 from
the io.c code.

One thing I have discovered is that if I remove the two rprintf() calls
from exit_cleanup() (changing them into fprintf(stderr) calls), I can't
get the test to fail.

My current theory is that the sender is closing down the socket, and if the
receiver just happens to get past the two rprintf()s before this happens,
then all is well.  If not, it gets an error (since something must be trying
to flush during the exit_cleanup(0) processing) and switches to an exit
of 12 (RERR_STREAMIO).

I'll finish debugging this later if no one else gets to it first.

..wayne..

Wayne Davison

2003-Jan-10 08:47 UTC

head link

working on a 2.5.6pre1 release

On Thu, Jan 09, 2003 at 05:09:07PM -0600, Dave Dykstra
wrote:> I'm stuck on a problem that some machines on build.samba.org are
> showing on the 'chgrp' test.
I've checked in a fix for this bug.  Here's what I discovered:

The reason only the chgrp test failed is that it is the only test that
uses -vvv instead of just -vv.  This is significant because -vvv is
needed for the receiver to try to do IO after killing off the generator.
This IO would sometimes fail if the generator died before it finished.

More specific details:

(Background) The receiver forks off the generator with an error pipe
open from the generator to the receiver.  The receiver notes this
error-receiving fd so that it gets monitored on all IO (and thus remains
unblocked).

(The bug) When the receiver signals the generator to die, it was not
clearing io_error_fd, and thus if it did any IO that just happened to
occur after the generator closed its end of the pipe, the IO would
notice the EOF on io_error_fd and cause the receiver to exit with
error 12.  Now that rsync properly notices the return value of the
receiver process, this causes the sender to also exit with an error.

(The fix) Just call io_set_error_fd(-1) prior to signaling the generator
to die.

I've appended the patch I just checked in.

..wayne..

---8<------8<------8<------8<---cut
here--->8------>8------>8------>8---
Index: main.c
--- main.c	9 Jan 2003 19:04:06 -0000	1.157
+++ main.c	10 Jan 2003 08:28:39 -0000
@@ -461,6 +461,7 @@
 	}
 	io_flush();
 
+	io_set_error_fd(-1);
 	kill(pid, SIGUSR2);
 	wait_process(pid, &status);
 	return status;
---8<------8<------8<------8<---cut
here--->8------>8------>8------>8---

Wayne Davison

2003-Jan-10 09:19 UTC

head link

working on a 2.5.6pre1 release

On Fri, Jan 10, 2003 at 12:46:13AM -0800, Wayne Davison
wrote:> This is significant because -vvv is needed for the receiver to try to
> do IO after killing off the generator.  This IO would sometimes fail
> if the generator died before [the IO] finished.
For the anal, I mixed up the generator and the receiver in my entire
description of the problem.  It is the generator that kills (and forks
off) the receiver, not the other way around.  Fortunately this slip of
terminology in my email does not adversely affect the patch at all.

..wayne..

Green, Paul

2003-Jan-10 18:09 UTC

head link

working on a 2.5.6pre1 release

Dave Dykstra [mailto:dwd@drdykstra.us] wrote:> 3. The Stratus VOS port is failing all 3 daemon tests in code
>    that is used just for testing, saying it can't create the test
>    socket.  I don't know if there's a corresponding problem in the
>    corresponding non-test code.
The socketpair_tcp calls fail because we do not implement non-blocking
connects. This is a known bug in VOS (stcp-1178) for which no fix is
currently available.  And there is no way to work around this in the
socket.c code.

I have updated the comment that appears in the output log to explain this.

Thanks
PG
--
Paul Green, Senior Technical Consultant, Stratus Technologies.
Day: +1 978-461-7557; FAX: +1 978-461-3610
Speaking from Stratus not for Stratus

Reasonably Related Threads

Search for more maybe matching threads

rsync - Jan 2003 - working on a 2.5.6pre1 release

working on a 2.5.6pre1 release

working on a 2.5.6pre1 release

working on a 2.5.6pre1 release

working on a 2.5.6pre1 release

working on a 2.5.6pre1 release

working on a 2.5.6pre1 release

Reasonably Related Threads