Hi,
In http://sources.redhat.com/ml/cygwin/2002-09/msg01155.html, I noted that
the often-observed hangs of rsync under Cygwin were assuaged by a call to
msleep().
After upgrading my Cygwin environment to rsync 2.5.6, I'm seeing these
hangs again, not surprisingly given a CVS entry for main.c notes that
this kludge was not harmless:
    Revision 1.162 / (download) - annotate - [select for diffs] ,
	Tue Jan 28 05:05:53 2003 UTC (4 months, 4 weeks ago) by dwd
    Remove the Cygwin msleep(100) before the generator kills the receiver,
    because it caused the testsuite/unsafe-links test to hang.
So it seems sensible to attempt something a bit more elegant.
And the first question is why kill/signals are being used
being used here at all.
The illustrative patch below I think effects an equivalent synchronization,
but does so by queuing a byte into a pipe rather than sending a signal.
Of course, since it's not currently done this way, I may be overlooking
something obvious. I can't quite see what though, since in the event
that an error occurs then exit_cleanup is available to send SIGUSR1
with extreme prejudice; but if the protocol in fact concludes cleanly
then there really should be no need for an asynchronous notification?
Comments sought, meanwhile I'll test the patch a bit...
Regards
Anthony
*** main.c.Orig	Fri Jun 27 15:21:22 2003
--- main.c	Fri Jun 27 15:30:09 2003
***************
*** 390,395 ****
--- 390,396 ----
  	int status=0;
  	int recv_pipe[2];
  	int error_pipe[2];
+ 	int cleanup_pipe[2];
  	extern int preserve_hard_links;
  	extern int delete_after;
  	extern int recurse;
***************
*** 416,426 ****
--- 417,435 ----
  		exit_cleanup(RERR_SOCKETIO);
  	}
    
+ 	if (pipe(cleanup_pipe) < 0) {
+ 		rprintf(FERROR,"cleanup pipe failed in do_recv\n");
+ 		exit_cleanup(RERR_SOCKETIO);
+ 	}
+   
  	io_flush();
  
  	if ((pid=do_fork()) == 0) {
+ 		char tmp;
+ 
  		close(recv_pipe[0]);
  		close(error_pipe[0]);
+ 		close(cleanup_pipe[1]);
  		if (f_in != f_out) close(f_out);
  
  		/* we can't let two processes write to the socket at one time */
***************
*** 436,450 ****
  		write_int(recv_pipe[1],1);
  		close(recv_pipe[1]);
  		io_flush();
! 		/* finally we go to sleep until our parent kills us
! 		   with a USR2 signal. We sleep for a short time as on
! 		   some OSes a signal won't interrupt a sleep! */
! 		while (msleep(20))
! 			;
  	}
  
  	close(recv_pipe[1]);
  	close(error_pipe[1]);
  	if (f_in != f_out) close(f_in);
  
  	io_start_buffering(f_out);
--- 445,465 ----
  		write_int(recv_pipe[1],1);
  		close(recv_pipe[1]);
  		io_flush();
! 		do {
! 			status = read(cleanup_pipe[0], &tmp, 1);
! 		} while (status == -1 && errno == EINTR);
! 		if (status != 1) {
! 			rprintf(FERROR,"cleanup read returned %d in do_recv\n", status);
! 			if (status == -1)
! 				rprintf(FERROR,"with errno %d (%s)\n", errno, strerror(errno));
! 			_exit(RERR_PARTIAL);
! 		}
! 		_exit(0);
  	}
  
  	close(recv_pipe[1]);
  	close(error_pipe[1]);
+ 	close(cleanup_pipe[0]);
  	if (f_in != f_out) close(f_in);
  
  	io_start_buffering(f_out);
***************
*** 462,469 ****
  	io_flush();
  
  	io_set_error_fd(-1);
! 	kill(pid, SIGUSR2);
! 	wait_process(pid, &status);
  	return status;
  }
  
--- 477,487 ----
  	io_flush();
  
  	io_set_error_fd(-1);
! 	write(cleanup_pipe[1], ".", 1);
! 	if (waitpid(pid, &status, 0) != pid) {
! 		rprintf(FERROR,"cleanup in do_recv failed\n");
! 		exit_cleanup(RERR_SOCKETIO);
! 	}
  	return status;
  }
  
***************
*** 867,878 ****
  	exit_cleanup(RERR_SIGNAL);
  }
  
- static RETSIGTYPE sigusr2_handler(int UNUSED(val)) {
- 	extern int log_got_error;
- 	if (log_got_error) _exit(RERR_PARTIAL);
- 	_exit(0);
- }
- 
  static RETSIGTYPE sigchld_handler(int UNUSED(val)) {
  #ifdef WNOHANG
  	int cnt, status;
--- 885,890 ----
***************
*** 964,970 ****
  	orig_argv = argv;
  
  	signal(SIGUSR1, sigusr1_handler);
- 	signal(SIGUSR2, sigusr2_handler);
  	signal(SIGCHLD, sigchld_handler);
  #ifdef MAINTAINER_MODE
  	signal(SIGSEGV, rsync_panic_handler);
--- 976,981 ----
--
Anthony Heading
This communication is for informational purposes only.  It is not intended as
an offer or solicitation for the purchase or sale of any financial instrument
or as an official confirmation of any transaction. All market prices, data
and other information are not warranted as to completeness or accuracy and
are subject to change without notice. Any comments or statements made herein
do not necessarily reflect those of J.P. Morgan Chase & Co., its
subsidiaries and affiliates.
Apparently this fixed the problem for Tillman, James. Could you regenerate the patch with diff -u please? On Fri, Jun 27, 2003 at 04:16:12PM +0900, Anthony Heading wrote:> Hi, > > In http://sources.redhat.com/ml/cygwin/2002-09/msg01155.html, I noted that > the often-observed hangs of rsync under Cygwin were assuaged by a call to > msleep(). > > After upgrading my Cygwin environment to rsync 2.5.6, I'm seeing these > hangs again, not surprisingly given a CVS entry for main.c notes that > this kludge was not harmless: > > Revision 1.162 / (download) - annotate - [select for diffs] , > Tue Jan 28 05:05:53 2003 UTC (4 months, 4 weeks ago) by dwd > > Remove the Cygwin msleep(100) before the generator kills the receiver, > because it caused the testsuite/unsafe-links test to hang. > > So it seems sensible to attempt something a bit more elegant. > > And the first question is why kill/signals are being used > being used here at all. > > The illustrative patch below I think effects an equivalent synchronization, > but does so by queuing a byte into a pipe rather than sending a signal. > > Of course, since it's not currently done this way, I may be overlooking > something obvious. I can't quite see what though, since in the event > that an error occurs then exit_cleanup is available to send SIGUSR1 > with extreme prejudice; but if the protocol in fact concludes cleanly > then there really should be no need for an asynchronous notification? > > Comments sought, meanwhile I'll test the patch a bit... > > Regards > > Anthony > > > *** main.c.Orig Fri Jun 27 15:21:22 2003 > --- main.c Fri Jun 27 15:30:09 2003 > *************** > *** 390,395 **** > --- 390,396 ---- > int status=0; > int recv_pipe[2]; > int error_pipe[2]; > + int cleanup_pipe[2]; > extern int preserve_hard_links; > extern int delete_after; > extern int recurse; > *************** > *** 416,426 **** > --- 417,435 ---- > exit_cleanup(RERR_SOCKETIO); > } > > + if (pipe(cleanup_pipe) < 0) { > + rprintf(FERROR,"cleanup pipe failed in do_recv\n"); > + exit_cleanup(RERR_SOCKETIO); > + } > + > io_flush(); > > if ((pid=do_fork()) == 0) { > + char tmp; > + > close(recv_pipe[0]); > close(error_pipe[0]); > + close(cleanup_pipe[1]); > if (f_in != f_out) close(f_out); > > /* we can't let two processes write to the socket at one time */ > *************** > *** 436,450 **** > write_int(recv_pipe[1],1); > close(recv_pipe[1]); > io_flush(); > ! /* finally we go to sleep until our parent kills us > ! with a USR2 signal. We sleep for a short time as on > ! some OSes a signal won't interrupt a sleep! */ > ! while (msleep(20)) > ! ; > } > > close(recv_pipe[1]); > close(error_pipe[1]); > if (f_in != f_out) close(f_in); > > io_start_buffering(f_out); > --- 445,465 ---- > write_int(recv_pipe[1],1); > close(recv_pipe[1]); > io_flush(); > ! do { > ! status = read(cleanup_pipe[0], &tmp, 1); > ! } while (status == -1 && errno == EINTR); > ! if (status != 1) { > ! rprintf(FERROR,"cleanup read returned %d in do_recv\n", status); > ! if (status == -1) > ! rprintf(FERROR,"with errno %d (%s)\n", errno, strerror(errno)); > ! _exit(RERR_PARTIAL); > ! } > ! _exit(0); > } > > close(recv_pipe[1]); > close(error_pipe[1]); > + close(cleanup_pipe[0]); > if (f_in != f_out) close(f_in); > > io_start_buffering(f_out); > *************** > *** 462,469 **** > io_flush(); > > io_set_error_fd(-1); > ! kill(pid, SIGUSR2); > ! wait_process(pid, &status); > return status; > } > > --- 477,487 ---- > io_flush(); > > io_set_error_fd(-1); > ! write(cleanup_pipe[1], ".", 1); > ! if (waitpid(pid, &status, 0) != pid) { > ! rprintf(FERROR,"cleanup in do_recv failed\n"); > ! exit_cleanup(RERR_SOCKETIO); > ! } > return status; > } > > *************** > *** 867,878 **** > exit_cleanup(RERR_SIGNAL); > } > > - static RETSIGTYPE sigusr2_handler(int UNUSED(val)) { > - extern int log_got_error; > - if (log_got_error) _exit(RERR_PARTIAL); > - _exit(0); > - } > - > static RETSIGTYPE sigchld_handler(int UNUSED(val)) { > #ifdef WNOHANG > int cnt, status; > --- 885,890 ---- > *************** > *** 964,970 **** > orig_argv = argv; > > signal(SIGUSR1, sigusr1_handler); > - signal(SIGUSR2, sigusr2_handler); > signal(SIGCHLD, sigchld_handler); > #ifdef MAINTAINER_MODE > signal(SIGSEGV, rsync_panic_handler); > --- 976,981 ---- > > > -- > Anthony Heading > > This communication is for informational purposes only. It is not intended as > an offer or solicitation for the purchase or sale of any financial instrument > or as an official confirmation of any transaction. All market prices, data > and other information are not warranted as to completeness or accuracy and > are subject to change without notice. Any comments or statements made herein > do not necessarily reflect those of J.P. Morgan Chase & Co., its > subsidiaries and affiliates. > > -- > To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync > Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html >-- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt
> -----Original Message----- > From: jw schultz [mailto:jw@pegasys.ws] > Sent: Wednesday, July 09, 2003 5:59 AM > To: rsync@lists.samba.org > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > I can't quite place why but my instincts inform me that you > > have latched onto something. Some sort of one character > > buffering error in the io libraries under cygwin. Most > > likely in the windos libs. > > > > Well, we have two reports of this fixing the rsync hang > > problem when signals failed. I'd like a little more testing > > before mainlining it. > > Nope! This is a no-go. It intermittantly produces > > error (10) -- error in socket IO > > on both network and local transfers. >I guess I'd better double check my processes to make sure that I'm getting a satisfactory success rate on my own servers. If I see any clues, I'll report them here. Any hope for a fix, or does this look like an inherent problem in the method being used? jpt
My sincerest apologies for the duplicate msgs from me that were sent to the list this morning. My email administrator must have done something quite stupid to have all msgs I've sent in the last week go out again! jpt> -----Original Message----- > From: Tillman, James > Sent: Wednesday, July 09, 2003 6:48 AM > To: rsync@lists.samba.org > Subject: RE: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > -----Original Message----- > > From: jw schultz [mailto:jw@pegasys.ws] > > Sent: Wednesday, July 09, 2003 5:59 AM > > To: rsync@lists.samba.org > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > > I can't quite place why but my instincts inform me that you > > > have latched onto something. Some sort of one character > > > buffering error in the io libraries under cygwin. Most > > > likely in the windos libs. > > > > > > Well, we have two reports of this fixing the rsync hang > > > problem when signals failed. I'd like a little more testing > > > before mainlining it. > > > > Nope! This is a no-go. It intermittantly produces > > > > error (10) -- error in socket IO > > > > on both network and local transfers. > > > > I guess I'd better double check my processes to make sure > that I'm getting a > satisfactory success rate on my own servers. If I see any clues, I'll > report them here. Any hope for a fix, or does this look like > an inherent > problem in the method being used? > > jpt > -- > To unsubscribe or change options: > http://lists.samba.org/mailman/listinfo/rsync > Before posting, read: > http://www.catb.org/~esr/faqs/smart-questions.html >
> -----Original Message----- > From: jw schultz [mailto:jw@pegasys.ws] > Sent: Saturday, July 12, 2003 11:25 AM > To: rsync@lists.samba.org > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > >[...]> > Anyhow, just to let you know. If you're happy tidying > > up and refining the patch yourself, please go ahead. If > > you want to me to do anything, or have any comments on > > what I've done, I'd appreciate an email. However I > > will try to follow the rsync list for the next few > > weeks at least. > > As i said earlier, i intuit you are on to something with > this patch. If you care to clean it up that would be good. > I would rather someone experiencing the hangs do the fix. > That tends to reduce the cycle times.I'm willing to help test if someone sends improvements on Anthony's original patch to list. The original has been working great for my own purposes so far. I realized when I started using it that I was being a little hasty, but my own situation required quicker action than is usually recommended. The risks were worth it, apparently. What I'm most interested in seeing is a real fix for this hang problem (Anthony's or someone else's) incorporated into an rsync release sometime in the near future so that I don't have to retain the patch code and special instructions for reinstalling my own running system. jpt
Ah, I just found the patch that jw sent (email system locked it as potential virus). Will try to compile and test this week. My own environment uses only SSH push. jpt> -----Original Message----- > From: jw schultz [mailto:jw@pegasys.ws] > Sent: Saturday, July 12, 2003 6:53 AM > To: rsync@lists.samba.org > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > On Wed, Jul 09, 2003 at 06:47:35AM -0400, Tillman, James wrote: > > > > > > > -----Original Message----- > > > From: jw schultz [mailto:jw@pegasys.ws] > > > Sent: Wednesday, July 09, 2003 5:59 AM > > > To: rsync@lists.samba.org > > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > > > > > I can't quite place why but my instincts inform me that you > > > > have latched onto something. Some sort of one character > > > > buffering error in the io libraries under cygwin. Most > > > > likely in the windos libs. > > > > > > > > Well, we have two reports of this fixing the rsync hang > > > > problem when signals failed. I'd like a little more testing > > > > before mainlining it. > > > > > > Nope! This is a no-go. It intermittantly produces > > > > > > error (10) -- error in socket IO > > > > > > on both network and local transfers. > > > > > > > I guess I'd better double check my processes to make sure > that I'm getting a > > satisfactory success rate on my own servers. If I see any > clues, I'll > > report them here. Any hope for a fix, or does this look > like an inherent > > problem in the method being used? > > It looks like the method is fairly sound. The problem seems > to primarily be in dealing with the child termination. > > io_set_error_fd(-1); > - kill(pid, SIGUSR2); > - wait_process(pid, &status); > + write(cleanup_pipe[1], ".", 1); > + if (waitpid(pid, &status, 0) != pid) { > + rprintf(FERROR,"cleanup in do_recv failed\n"); > + exit_cleanup(RERR_SOCKETIO); > + } > return status; > > There is a huge window between the write() and the return of > waitpid() that depending on scheduling and signal delivery > allows the child pid to be reaped by SIGCHILD handler. That > results in this waitpid() returning -1 with errno of ECHILD. > EINTER would also be possible. The timing dependencies > account for intermittency of the error. > > I've attached an altered patch. I've only dealt with this > one location which produced errors doing a ssh pull. I > haven't addressed the local transfer errors but i suspect > that derived from this waitpid error. Further testing will > still be needed to ensure that ssh push and rsyncd usage are > unbroken. This really needs testing in cygwin which i don't > have. If it takes care of the the cygwin hang then we can > polish it. There remains the issue of an error status when > when the only failure is termination. > > -- > ________________________________________________________________ > J.W. Schultz Pegasystems Technologies > email address: jw@pegasys.ws > > Remember Cernan and Schmitt >