andrew.marlow at uk.bnpparibas.com
2010-Jun-02  08:40 UTC
rsync 3.0.7 network errors on MS-Windows
I am experiencing intermittent network failures on rsync 3.0.7 built using cygwin for Windows-XP (SP2). I am using GCC v4.4.2 and the latext version of cygwin. The rsync error long indicates things like: rsync: writefd_unbuffered failed to write 4092 bytes to socket [generator]: Connection reset by peer (104)rsync: read error: Connection reset by peer (104) rsync error: error in rsync protocol data stream (code 12) at io.c(1530) [generator=3.0.7] rsync error: error in rsync protocol data stream (code 12) at io.c(760) [receiver=3.0.7] Googling I see that these problems were put down to the way socket are cleaned up in Windows and a fix put in place in cleanup.c, in close_all(). But the fix is surrounded by conditional compilation:- #ifdef SHUTDOWN_ALL_SOCKETS : : #endif Can someone please explain why that is? Shouldn't the fix just be there always, and regardless of which operating system? Regards, Andrew Marlow ___________________________________________________________ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/information/legal_information.asp?Code=ECAS-845C5H for additional disclosures. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20100602/7c593150/attachment.html>
andrew.marlow at uk.bnpparibas.com wrote:> > I am experiencing intermittent network failures on rsync 3.0.7 built > using cygwin for Windows-XP (SP2). I am using GCC v4.4.2 and the > latext version of cygwin. > The rsync error long indicates things like: > rsync: writefd_unbuffered failed to write 4092 bytes to socket > [generator]: > Connection reset by peer (104)rsync: read error: Connection reset by > peer (104) > rsync error: error in rsync protocol data stream (code 12) at > io.c(1530) [generator=3.0.7] > rsync error: error in rsync protocol data stream (code 12) at > io.c(760) [receiver=3.0.7] > Googling I see that these problems were put down to the way socket are > cleaned up in Windows and a fix put in place in cleanup.c, in > close_all(). But the fix is surrounded by conditional compilation:- > #ifdef SHUTDOWN_ALL_SOCKETS > : > : > #endif > Can someone please explain why that is? Shouldn't the fix just be > there always, and regardless of which operating system?It's not needed on most operating systems - as the comment there implies. According to the notes copied below, SO_LINGER is off by default on unix sockets, and this means close() will gracefully send the remaining data in the background, rather than TCP RST. You can assume that program exit has the same effect as close(). If SO_LINGER is turned on with a zero timeout, the notes below say TCP RST is sent on close, which is much like what the comment for SHUTDOWN_ALL_SOCKETS says is happening on Windows without SO_LINGER. Presumably Windows sockets - or at least some version of them (there are several versions of Winsock) - behaves differently from unix sockets in this area. It wouldn't be surprising, as historically Winsock ran inside the process not the kernel, so an exiting process couldn't implement the unix graceful close behaviour, and maybe they kept that behaviour the same in later versions. That said, I still don't see why SHUTDOWN_ALL_SOCKETS would fix it. Calling shutdown(fd,2) closes it in both directions, and at least with usual unix sockets, that would trigger TCP RST anyway if the other end sends any data after the shutdown. Which it seems to be doing: "writefd_unbuffered failed to write 4092 bytes to socket" implies the other end has closed or shutdown(fd,1) or shutdown(fd,2), and then data is sent to it which can't be accepted so the other end sent back TCP RST anyway. If rsync is doing that in normal operation, that ought to be a problem on unix just as much as Windows - and SHUTDOWN_ALL_SOCKETS ought to be insufficient to prevent the reset. Which suggests to me that "writefd_unbuffered failed to write 4092 bytes to socket" is a symptom of a different problem. Here are the notes I referred to above. These are the notes which explain SO_LINGER's behaviour: Unix Socket FAQ http://www.developerweb.net/forum/archive/index.php/t-2982.html 4.6 - What exactly does SO_LINGER do? Contributed by Cyrus Patel SO_LINGER affects the behaviour of the close() operation as described below. No socket operation other than close() is affected by SO_LINGER. The following description of the effect of SO_LINGER has been culled from the setsockopt() and close() man pages for several systems, but may still not be applicable to your system. The range of differences in implementation ranges from not supporting SO_LINGER at all; or only supporting it partially; or having to deal with the "peculiarities" in a particular implementation. (see portability notes at end). Moreover, the purpose of SO_LINGER is very, very specific and only a tiny minority of socket applications actually need it. Unless you are extremely familiar with the intricacies of TCP and the BSD socket API, you could very easily end up using SO_LINGER in a way for which it was not designed. The effect of an setsockopt(..., SO_LINGER,...) depends on what the values in the linger structure (the third parameter passed to setsockopt()) are: Case 1: linger->l_onoff is zero (linger->l_linger has no meaning): This is the default. On close(), the underlying stack attempts to gracefully shutdown the connection after ensuring all unsent data is sent. In the case of connection-oriented protocols such as TCP, the stack also ensures that sent data is acknowledged by the peer. The stack will perform the above-mentioned graceful shutdown in the background (after the call to close() returns), regardless of whether the socket is blocking or non-blocking. Case 2: linger->l_onoff is non-zero and linger->l_linger is zero: A close() returns immediately. The underlying stack discards any unsent data, and, in the case of connection-oriented protocols such as TCP, sends a RST (reset) to the peer (this is termed a hard or abortive close). All subsequent attempts by the peer's application to read()/recv() data will result in an ECONNRESET. Case 3: linger->l_onoff is non-zero and linger->l_linger is non-zero: A close() will either block (if a blocking socket) or fail with EWOULDBLOCK (if non-blocking) until a graceful shutdown completes or the time specified in linger->l_linger elapses (time-out). Upon time-out the stack behaves as in case 2 above. --------------------------------------------------------------- Portability note 1: Some implementations of the BSD socket API do not implement SO_LINGER at all. On such systems, applying SO_LINGER either fails with EINVAL or is (silently) ignored. Having SO_LINGER defined in the headers is no guarantee that SO_LINGER is actually implemented. Portability note 2: Since the BSD documentation on SO_LINGER is sparse and inadequate, it is not surprising to find the various implementations interpreting the effect of SO_LINGER differently. For instance, the effect of SO_LINGER on non-blocking sockets is not mentioned at all in BSD documentation, and is consequently treated differently on different platforms. Taking case 3 for example: Some implementations behave as described above. With others, a non-blocking socket close() succeed immediately leaving the rest to a background process. Others ignore non-blocking'ness and behave as if the socket were blocking. Yet others behave as if SO_LINGER wasn't in effect [as if the case 1, the default, was in effect], or ignore linger->l_linger [case 3 is treated as case 2]. Given the lack of adequate documentation, such differences are not (by themselves) indicative of an "incomplete" or "broken" implementation. They are simply different, not incorrect. Portability note 3: Some implementations of the BSD socket API do not implement SO_LINGER completely. On such systems, the value of linger->l_linger is ignored (always treated as if it were zero). Technical/Developer note: SO_LINGER does (should) not affect a stack's implementation of TIME_WAIT. In any event, SO_LINGER is not the way to get around TIME_WAIT. If an application expects to open and close many TCP sockets in quick succession, it should be written to use only a fixed number and/or range of ports, and apply SO_REUSEPORT to sockets that use those ports. Related note: Many BSD sockets implementations also support a SO_DONTLINGER socket option. This socket option has the exact opposite meaning of SO_LINGER, and the two are treated (after inverting the value of linger->l_onoff) as equivalent. In other words, SO_LINGER with a zero linger->l_onoff is the same as SO_DONTLINGER with a non-zero linger->l_onoff, and vice versa. -- Jamie