I believe I have found the cause of the unexplained error (code ??) at main.c(line #). In the version I'm running 2.5.2 (obtained from the Free Software Foundation) the line number is 576. It appears the root cause is related to a race condition associated with the termination of child processes. If the signal handler for SIGCHILD is executed, as the result of a child termination, before the wait_process procedure is executed, the status of pid (as passed to wait_process) will not be available and the waitpid call in wait_process will fail with an ECHILD error. If however, wait_process executes first, it will successfully obtain the exit status of pid and the sigchld_handler can execute, to eliminate zombies, with no adverse affects. Since signal handlers execute asynchronously there is no way to predict when, if at all, a process will encounter this problem. I am providing below proposed new code that should resolve the problem. Certainly I can make these changes within my own environment, but since I would like to remain consistent with the rsync project, I would like to here from someone regarding incorporation of these changes into rsync or an alternative (official) method to fixing this problem. I can be reached at drstaples@beckman.com; drstaples@drstaples.com; or drstapl@empirenet.com Sincerely, David R. Staples -------------------------------------------------------------------------------- Proposed new code in main.c #typedef struct int pid; int status; } pid_status; pid_status pid_stat_table[10]; static RETSIGTYPE sigchld_handler(int val) { #ifdef WNOHANG int indx int pid; int status; do { pid = waitpid(-1, &status, WNOHANG); for ( indx = 0; indx < 10; indx++ ) { if ( pid_stat_table[indx].pid == 0 ) { pid_stat_table[indx].pid = pid; pid_stat_table[indx].status = status; break; } } } while ( pid > 0 ); #endif } void wait_process(pid_t pid, int *status) { int waited_pid; int indx; do { waited_pid = waitpid(pid, status, WNOHANG); if ( waited_pid == 0) { msleep(20); ioflush(); } } while ( waited_pid == 0 ); if (( waited_pid == -1 ) && ( errno == ECHILD )) { /* status of requested child no longer available. Check */ /* to see if it was processed by the sigchld_handler. */ for ( indx = 0; indx < 10; indx++ ) { if ( pid == pid_stat_table[indx].pid ) { *status = pid_stat_table[indx].status; break; } } *status = WEXITSTATUS(*status); } -------------- next part -------------- HTML attachment scrubbed and removed
To start with, use the carrage return and don't send html. This sounds like a problem that was fixed a few months ago. You might try searching archives. Upgrade to current (2.5.5) or the CVS tree. And please send any changes as diff -u against up-to-date CVS (see diff(1) and patch(1)) encoded as text/plain or flat ascii. With what you sent it is unclear what you changed and requires much work to apply changes. On Sun, Sep 01, 2002 at 05:12:32PM -0700, David R. Staples wrote: (reformatted)> I believe I have found the cause of the unexplained error > (code ??) at main.c(line #). In the version I'm running > 2.5.2 (obtained from the Free Software Foundation) the line > number is 576. It appears the root cause is related to a > race condition associated with the termination of child > processes. > > If the signal handler for SIGCHILD is executed, as the > result of a child termination, before the wait_process > procedure is executed, the status of pid (as passed to > wait_process) will not be available and the waitpid call in > wait_process will fail with an ECHILD error. If however, > wait_process executes first, it will successfully obtain the > exit status of pid and the sigchld_handler can execute, to > eliminate zombies, with no adverse affects. Since signal > handlers execute asynchronously there is no way to predict > when, if at all, a process will encounter this problem. > > I am providing below proposed new code that should resolve > the problem. Certainly I can make these changes within my > own environment, but since I would like to remain consistent > with the rsync project, I would like to here from someone > regarding incorporation of these changes into rsync or an > alternative (official) method to fixing this problem. I can > be reached at drstaples@beckman.com; > drstaples@drstaples.com; or drstapl@empirenet.com > > Sincerely, > David R. Staples > > -------------------------------------------------------------------------------- > Proposed new code in main.c[snip]
JW, Sorry for p*ssing you off, that was not my intention. I looked at the archives and saw no resolution to this specific problem in any of the rsync-2.5.3-NEWS, rsync-2.5.4-NEWS or rsync-2.5.5-NEWS files. I am somewhat new to the *NIX environment and don't yet understand all the tools or protocols (CVS, diff(1), patch(1)) or how they work. However, I did just download from ftp.samba.org/ftp/rsync the 2.5.5 source (not sure this is where I should be getting it from or not). In any event, the source that exist there (with regard to handling terminating children) is no different from the 2.5.2 version. I will look into how to properly format the changes for diff(1) and/or patch(1). I would appreciate it if someone would let me know how and to where I should submit the changes for incorporation into the product. Dave "jw schultz" <jw@pegasys.ws To: "David R. Staples" <drstapl@empirenet.com> > cc: rsync@samba.org, drstaples@beckman.com Subject: Re: rsync error: unexplained error 09/01/02 05:34 PM To start with, use the carrage return and don't send html. This sounds like a problem that was fixed a few months ago. You might try searching archives. Upgrade to current (2.5.5) or the CVS tree. And please send any changes as diff -u against up-to-date CVS (see diff(1) and patch(1)) encoded as text/plain or flat ascii. With what you sent it is unclear what you changed and requires much work to apply changes.