I believe I have found the cause of the unexplained error (code ??) at
main.c(line #). In the version I'm running 2.5.2 (obtained from the Free
Software Foundation) the line number is 576. It appears the root cause is
related to a race condition associated with the termination of child processes.
If the signal handler for SIGCHILD is executed, as the result of a child
termination, before the wait_process procedure is executed, the status of pid
(as passed to wait_process) will not be available and the waitpid call in
wait_process will fail with an ECHILD error. If however, wait_process executes
first, it will successfully obtain the exit status of pid and the
sigchld_handler can execute, to eliminate zombies, with no adverse affects.
Since signal handlers execute asynchronously there is no way to predict when, if
at all, a process will encounter this problem.
I am providing below proposed new code that should resolve the problem.
Certainly I can make these changes within my own environment, but since I would
like to remain consistent with the rsync project, I would like to here from
someone regarding incorporation of these changes into rsync or an alternative
(official) method to fixing this problem.
I can be reached at drstaples@beckman.com; drstaples@drstaples.com; or
drstapl@empirenet.com
Sincerely,
David R. Staples
--------------------------------------------------------------------------------
Proposed new code in main.c
#typedef struct
int pid;
int status;
} pid_status;
pid_status pid_stat_table[10];
static RETSIGTYPE sigchld_handler(int val) {
#ifdef WNOHANG
int indx
int pid;
int status;
do {
pid = waitpid(-1, &status, WNOHANG);
for ( indx = 0; indx < 10; indx++ ) {
if ( pid_stat_table[indx].pid == 0 ) {
pid_stat_table[indx].pid = pid;
pid_stat_table[indx].status = status;
break;
}
}
} while ( pid > 0 );
#endif
}
void wait_process(pid_t pid, int *status)
{
int waited_pid;
int indx;
do {
waited_pid = waitpid(pid, status, WNOHANG);
if ( waited_pid == 0) {
msleep(20);
ioflush();
}
} while ( waited_pid == 0 );
if (( waited_pid == -1 ) && ( errno == ECHILD )) {
/* status of requested child no longer available. Check */
/* to see if it was processed by the sigchld_handler. */
for ( indx = 0; indx < 10; indx++ ) {
if ( pid == pid_stat_table[indx].pid ) {
*status = pid_stat_table[indx].status;
break;
}
}
*status = WEXITSTATUS(*status);
}
-------------- next part --------------
HTML attachment scrubbed and removed
To start with, use the carrage return and don't send html. This sounds like a problem that was fixed a few months ago. You might try searching archives. Upgrade to current (2.5.5) or the CVS tree. And please send any changes as diff -u against up-to-date CVS (see diff(1) and patch(1)) encoded as text/plain or flat ascii. With what you sent it is unclear what you changed and requires much work to apply changes. On Sun, Sep 01, 2002 at 05:12:32PM -0700, David R. Staples wrote: (reformatted)> I believe I have found the cause of the unexplained error > (code ??) at main.c(line #). In the version I'm running > 2.5.2 (obtained from the Free Software Foundation) the line > number is 576. It appears the root cause is related to a > race condition associated with the termination of child > processes. > > If the signal handler for SIGCHILD is executed, as the > result of a child termination, before the wait_process > procedure is executed, the status of pid (as passed to > wait_process) will not be available and the waitpid call in > wait_process will fail with an ECHILD error. If however, > wait_process executes first, it will successfully obtain the > exit status of pid and the sigchld_handler can execute, to > eliminate zombies, with no adverse affects. Since signal > handlers execute asynchronously there is no way to predict > when, if at all, a process will encounter this problem. > > I am providing below proposed new code that should resolve > the problem. Certainly I can make these changes within my > own environment, but since I would like to remain consistent > with the rsync project, I would like to here from someone > regarding incorporation of these changes into rsync or an > alternative (official) method to fixing this problem. I can > be reached at drstaples@beckman.com; > drstaples@drstaples.com; or drstapl@empirenet.com > > Sincerely, > David R. Staples > > -------------------------------------------------------------------------------- > Proposed new code in main.c[snip]
JW,
Sorry for p*ssing you off, that was not my intention.
I looked at the archives and saw no resolution to this specific problem in
any of the rsync-2.5.3-NEWS, rsync-2.5.4-NEWS or rsync-2.5.5-NEWS files.
I am somewhat new to the *NIX environment and don't yet understand all
the tools or protocols (CVS, diff(1), patch(1)) or how they work.
However, I did just download from ftp.samba.org/ftp/rsync the 2.5.5 source
(not sure this is where I should be getting it from or not). In any event, the
source that exist there (with regard to handling terminating children) is no
different from the 2.5.2 version.
I will look into how to properly format the changes for diff(1) and/or patch(1).
I would appreciate it if someone would let me know how and to where I should
submit the changes for incorporation into the product.
Dave
"jw schultz"
<jw@pegasys.ws To: "David R. Staples"
<drstapl@empirenet.com>
> cc: rsync@samba.org,
drstaples@beckman.com
Subject: Re: rsync error:
unexplained error
09/01/02 05:34
PM
To start with, use the carrage return and don't send html.
This sounds like a problem that was fixed a few months ago.
You might try searching archives. Upgrade to current
(2.5.5) or the CVS tree.
And please send any changes as diff -u against up-to-date
CVS (see diff(1) and patch(1)) encoded as text/plain or flat
ascii. With what you sent it is unclear what you changed
and requires much work to apply changes.