I'm working on trying to get rsync 2.5.6pre1 available for people to test more widely. I'm out of time for today, and I'm stuck on a problem that some machines on build.samba.org are showing on the 'chgrp' test. I can reproduce this on my home redhat 7.3 system too. It appears to be a timing problem because when I do strace -F -f on it the problem goes away. Everything seems to go through normally but then it exits with an exit code of 12, I think because the child receiver process is terminated with a SIGUSR2 which is signal 12 and because the bug that was preventing exit codes from being properly reported from children has now been fixed. It's very hard to debug because it is a timing problem and because it happens after rprintf handling is already shut down in the child process. I suspected that maybe the catching of the SIGUSR2 signal was not getting inherited from its parent, but it doesn't help to re-set it in the child. Nevertheless, I'm not sure whether or not the sigusr2_handler function is getting called in the child. I'd appreciate some help with this if anybody else thinks they can figure it out. - Dave Dykstra
On Thu, Jan 09, 2003 at 05:09:07PM -0600, Dave Dykstra wrote:> I'm working on trying to get rsync 2.5.6pre1 available for people to > test more widely. I'm out of time for today, and I'm stuck on a problem > that some machines on build.samba.org are showing on the 'chgrp' test. > I can reproduce this on my home redhat 7.3 system too. It appears to be a > timing problem because when I do strace -F -f on it the problem goes away. > Everything seems to go through normally but then it exits with an exit > code of 12, I think because the child receiver process is terminated with > a SIGUSR2 which is signal 12 and because the bug that was preventing exit > codes from being properly reported from children has now been fixed. > It's very hard to debug because it is a timing problem and because it > happens after rprintf handling is already shut down in the child process. > I suspected that maybe the catching of the SIGUSR2 signal was not getting > inherited from its parent, but it doesn't help to re-set it in the child. > Nevertheless, I'm not sure whether or not the sigusr2_handler function > is getting called in the child. I'd appreciate some help with this if > anybody else thinks they can figure it out.I haven't pinned it down but the problem appears to have been introduced in 1.157 of main.c with the lost exist status patch. I was able to reproduce your error on the chgrp test and backing out to 1.156 fixed it. That at least narrows it down. -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt
On Thu, Jan 09, 2003 at 05:09:07PM -0600, Dave Dykstra wrote:> It's very hard to debug because it is a timing problem and because it > happens after rprintf handling is already shut down in the child process.Fortunately fprintf(stderr, ...) always works, even in the child process. This is what I've been using to get some status on the problem.> Everything seems to go through normally but then it exits with an exit > code of 12, I think because the child receiver process is terminated with > a SIGUSR2 which is signal 12 and because the bug that was preventing exit > codes from being properly reported from children has now been fixed.The value of SIGUSR2 is a red herring. The error is really RERR_STREAMIO, which is being returned by the whine_about_eof() routine. I haven't had time to figure out why this code is getting sent during the final phase of the life of the receiver yet, though. The receiver successfully kills the generator, gets its 0 status code, begins to return a 0 status code, and then it suddenly starts exit_cleanup() over again with the error 12 from the io.c code. One thing I have discovered is that if I remove the two rprintf() calls from exit_cleanup() (changing them into fprintf(stderr) calls), I can't get the test to fail. My current theory is that the sender is closing down the socket, and if the receiver just happens to get past the two rprintf()s before this happens, then all is well. If not, it gets an error (since something must be trying to flush during the exit_cleanup(0) processing) and switches to an exit of 12 (RERR_STREAMIO). I'll finish debugging this later if no one else gets to it first. ..wayne..
On Thu, Jan 09, 2003 at 05:09:07PM -0600, Dave Dykstra wrote:> I'm stuck on a problem that some machines on build.samba.org are > showing on the 'chgrp' test.I've checked in a fix for this bug. Here's what I discovered: The reason only the chgrp test failed is that it is the only test that uses -vvv instead of just -vv. This is significant because -vvv is needed for the receiver to try to do IO after killing off the generator. This IO would sometimes fail if the generator died before it finished. More specific details: (Background) The receiver forks off the generator with an error pipe open from the generator to the receiver. The receiver notes this error-receiving fd so that it gets monitored on all IO (and thus remains unblocked). (The bug) When the receiver signals the generator to die, it was not clearing io_error_fd, and thus if it did any IO that just happened to occur after the generator closed its end of the pipe, the IO would notice the EOF on io_error_fd and cause the receiver to exit with error 12. Now that rsync properly notices the return value of the receiver process, this causes the sender to also exit with an error. (The fix) Just call io_set_error_fd(-1) prior to signaling the generator to die. I've appended the patch I just checked in. ..wayne.. ---8<------8<------8<------8<---cut here--->8------>8------>8------>8--- Index: main.c --- main.c 9 Jan 2003 19:04:06 -0000 1.157 +++ main.c 10 Jan 2003 08:28:39 -0000 @@ -461,6 +461,7 @@ } io_flush(); + io_set_error_fd(-1); kill(pid, SIGUSR2); wait_process(pid, &status); return status; ---8<------8<------8<------8<---cut here--->8------>8------>8------>8---
On Fri, Jan 10, 2003 at 12:46:13AM -0800, Wayne Davison wrote:> This is significant because -vvv is needed for the receiver to try to > do IO after killing off the generator. This IO would sometimes fail > if the generator died before [the IO] finished.For the anal, I mixed up the generator and the receiver in my entire description of the problem. It is the generator that kills (and forks off) the receiver, not the other way around. Fortunately this slip of terminology in my email does not adversely affect the patch at all. ..wayne..
Dave Dykstra [mailto:dwd@drdykstra.us] wrote:> 3. The Stratus VOS port is failing all 3 daemon tests in code > that is used just for testing, saying it can't create the test > socket. I don't know if there's a corresponding problem in the > corresponding non-test code.The socketpair_tcp calls fail because we do not implement non-blocking connects. This is a known bug in VOS (stcp-1178) for which no fix is currently available. And there is no way to work around this in the socket.c code. I have updated the comment that appears in the output log to explain this. Thanks PG -- Paul Green, Senior Technical Consultant, Stratus Technologies. Day: +1 978-461-7557; FAX: +1 978-461-3610 Speaking from Stratus not for Stratus