Jeff Davis
2007-Jan-31 19:39 UTC
send() returns error even though data is sent, TCP connection still alive
I am on FreeBSD 6.1 and I'm seeing write() return EHOSTDOWN while
keeping the connection alive.
I wrote a simple C client on the affected FreeBSD box to write a series
of integers to a server program on another machine. When the client's
write receives an the EHOSTDOWN, the data it sent arrives on the server
program anyway. Moreover, when I write() again on the same socket, the
data goes through as if nothing ever happened without further errors.
The connection is not broken by the EHOSTDOWN, and the client never
knows the difference. In fact, if the application just ignores the error
from write() everything appears fine after that.
The simplest way to see the problem is with SSH. Machine A is a freebsd
box, and machine B is another box on the same switch.
(1) ssh from A to B
(2) see on A that "arp -a" shows the entry for B
(3) on A do "arp -d B"
(4) pull network cable
(5) type <return> to try to send data over the SSH session (of course
nothing will happen, the network cable is still out)
(6) after the network cable has been unplugged for about 8 seconds, plug
it back in
(7) type in the SSH session again
You should see something like "write failed: host is down" and the
session will terminate. Of course, when ssh exits, the TCP connection
closes. The only way to see that it's still open and active is by
writing (or using) an application that ignores EHOSTDOWN errors from
write(). I think some scripting languages do not generate an exception
in that case.
This is very strange behavior and it's causing all kinds of problems on
our network. Does anyone have an explanation for this? Why would a TCP
operation return an error without closing the connection and send the
data anyway? This has existed for a long time.
I believe this is related to:
http://www.freebsd.org/cgi/query-pr.cgi?pr=100172
which is related to:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/netinet/if_ether.c?
only_with_tag=RELENG_6#rev1.137.2.5
I tried the patch here:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/netinet/if_ether.c?
f=h#rev1.158
(rev 1.158)
but I can still generate the error I mentioned.
Also, what's even more strange is that I set arp to be static on the
production machine, and I am still getting EHOSTDOWNs.
Regards,
Jeff Davis
Garrett Wollman
2007-Jan-31 20:40 UTC
send() returns error even though data is sent, TCP connection still alive
In article <1170269163.22436.71.camel@dogma.v10.wvs>, Jeff Davis <freebsd@j-davis.com> wrote:>You should see something like "write failed: host is down" and the >session will terminate. Of course, when ssh exits, the TCP connection >closes. The only way to see that it's still open and active is by >writing (or using) an application that ignores EHOSTDOWN errors from >write().I agree that it's a bug. The only time write() on a stream socket should return the asynchronous error[1] is when the connection has been (or is in the process of being) torn down as a result of a subsequent timeout. POSIX says "may fail" for these errors write() and send() on sockets -GAWollman [1] There are two kinds of error returns in the socket model: synchronous errors, like synchronous signals, are attributed to the result of a specific system call, detected prior to syscall return, and usually represent programming or user error (e.g., attempting to connect() on an fd that is not a socket). Asynchronous errors are detected asynchronously, and merely posted to the socket without being delivered; they may be delivered on the next socket operation. See XSH 2.10.10, "Pending Error".