thr3ads.net - CentOS - [CentOS] Socket behavior change from 6.5 to 6.6 [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Glenn Eychaner

2015-Jan-15 18:40 UTC

[CentOS] Socket behavior change from 6.5 to 6.6

I will try to explain this as best I can. I have two computers; one a
Supermicro X10SAE running CentOS 6, the other a very old DOS box.[*] The DOS
box runs a CCD camera, sending images via Ethernet to the X10SAE.  Thus, the
X10SAE runs a Python server on port 5700 (a socket which binds to 5700 and
listens, and then accepts a connection from the DOS box; nothing fancy).[**]
The DOS box connects to the server and sends images.  This all works great,
except:

When the DOS box exits, crashes, or is rebooted, it fails to shut down the
socket properly. Under CentOS 6.5, upon reboot, when the DOS box would attempt
to reconnect, the original accepted server socket would (after a couple of
connection attempts from the DOS box) see a 0-length recv and close, allowing
the server to accept a new connection and resume receiving images.

Under CentOS 6.6, the server never sees the 0-length recv. The DOS box flails
away attempting to reconnect forever, and the server never seems to get any
type of signal that the DOS box is attempting to reconnect.

Possibly relevant facts:
- The DOS box uses the same local port (1025) every time it tries to connect. It
does not use a random ephemeral port.
- The exact same code was tested on a CentOS 6.5 and 6.6 box, resulting in the
described behavior. The boxes were identical clones except for the O/S upgrade.
- The Python interpreter was not changed during the upgrade, because I run this
code using my own 2.7.2 install. However, both glibc and the kernel were
upgraded as part of the O/S upgrade.

My only theory is that this has something to do with non-ephemeral ports and
socket reuse, but I'm not sure what. It is entirely possible that some
low-level socket option default has changed between 6.5 and 6.6, and I
wouldn't know it. It is also possible that I have been relying on
unsupported
behavior this whole time, and that the current behavior is actually correct.

Does anyone have any insight they can offer?

[*] Hardware is not an issue; in fact, I have two identical systems, each of
which has one X10SAE and three DOS boxes.  But the problem can be boiled down
to a single pair.
[**] I'm actually using an asyncore.dispatcher to do the bind/listen, and
then
tossing the accept()ed socket into an asynchat. But I actually went ahead and
put a trap on socket.recv() just to be sure that I'm not swallowing the
0-length recv by accident.

-G.
--
Glenn Eychaner (geychaner at lco.cl)
Telescope Systems Programmer, Las Campanas Observatory

Greg Lindahl

2015-Jan-15 19:10 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

On Thu, Jan 15, 2015 at 03:40:08PM -0300, Glenn Eychaner wrote:
> My only theory is that this has something to do with non-ephemeral ports
and
> socket reuse, but I'm not sure what.
If you want a quick detection that the link is dead, have the server
occasionally send bytes to the dos box. You will get an immediate
error if the dos box is up and knows that connection is kaput.

Given that the port numbers of the new connection are the same, I'm
kind of surprised that the behavior changed from 6.5 to 6.6, but, I
always use defensive programming (sending those extra bytes).

-- greg

Alexander Farber

2015-Jan-15 19:22 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

Since you always use the same local port -
maybe you need to set SO_REUSEADDR option.

Greetings from Germany
Alex

Glenn Eychaner

2015-Jan-16 12:18 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

[I wish I knew how to get the mailing list to thread my replies properly in the
archives; I subscribe to the daily digest, and replying to that doesn't do
it.]

Greg Lindahl wrote:> On Thu, Jan 15, 2015 at 03:40:08PM -0300, Glenn Eychaner wrote:
> 
> > My only theory is that this has something to do with non-ephemeral
ports and
> > socket reuse, but I'm not sure what.
> 
> If you want a quick detection that the link is dead, have the server
> occasionally send bytes to the dos box. You will get an immediate
> error if the dos box is up and knows that connection is kaput.
What if I am sending bytes to the DOS box, but it never reads the socket?
(Let us assume, for the sake of argument, that I can't change the DOS box
software. In fact, I can, but it's more difficult than changing the Linux
end.)
Won't that either result in my detecting the socket as "dead" when
it is not,
or eventually overflowing the socket buffering?
> Given that the port numbers of the new connection are the same, I'm
> kind of surprised that the behavior changed from 6.5 to 6.6, but, I
> always use defensive programming (sending those extra bytes).
I was super-surprised by the change, in that I fully tested the upgrade on
my simulator system before deploying, and still got bit on deployment.
Of course, the simulator doesn't have a real DOS box, just a simulation
process that sends the images. [And, I also recently got bit by this
http://www.macstadium.com/blog/osx-10-9-mavericks-bugs/
after upgrading some Macs. Sigh, network issues.]

Alex from Germany wrote:> Since you always use the same local port -
> maybe you need to set SO_REUSEADDR option.
I assume I would have to set that on the client (DOS) side (the box which is
using the same local port 1025 each time); setting it on the bound-listener
socket on the Linux side doesn't seem like it would do anything to resolve
the issue, based on my reading of SO_REUSEADDR on the net:
http://www.unixguide.net/network/socketfaq/4.5.shtml
http://stackoverflow.com/questions/14388706/

-G.
--
Glenn Eychaner (geychaner at lco.cl)
Telescope Systems Programmer, Las Campanas Observatory

Alexander Farber

2015-Jan-16 14:46 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

What about SO_LINGER at the Linux side, have you tried that?
http://stackoverflow.com/questions/3757289/tcp-option-so-linger-zero-when-its-required

On Fri, Jan 16, 2015 at 1:18 PM, Glenn Eychaner <geychaner at mac.com>
wrote:>> Since you always use the same local port -
>> maybe you need to set SO_REUSEADDR option.
>
> I assume I would have to set that on the client (DOS) side (the box which
is
> using the same local port 1025 each time); setting it on the bound-listener
> socket on the Linux side doesn't seem like it would do anything to
resolve
> the issue, based on my reading of SO_REUSEADDR on the net:
> http://www.unixguide.net/network/socketfaq/4.5.shtml
> http://stackoverflow.com/questions/14388706/

Les Mikesell

2015-Jan-16 15:56 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

On Fri, Jan 16, 2015 at 6:18 AM, Glenn Eychaner <geychaner at mac.com>
wrote:>
> I was super-surprised by the change, in that I fully tested the upgrade on
> my simulator system before deploying, and still got bit on deployment.
I'm not sure I completely understand the scenario, but it seems wrong
for it to have worked before.  Why should a 'new' attempt at a
connection with different tcp sequence numbers have been able to have
any affect on the working socket that hasn't been closed yet at the
other end unless it is sending a RST packet. Might be interesting to
watch with wireshark to see if you are getting a RST that doesn't
close the old connection.

-- 
   Les Mikesell
     lesmikesell at gmail.com

Warren Young

2015-Jan-16 17:42 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

On Jan 15, 2015, at 11:40 AM, Glenn Eychaner <geychaner at mac.com> wrote:
> When the DOS box exits, crashes, or is rebooted, it fails to shut down the
> socket properly.
Yes, that?s what happens when you use an OS that doesn?t implement sockets in
kernel space: there is no program still running that can send the RST packet for
the dead socket.
> Under CentOS 6.5, upon reboot, when the DOS box would attempt
> to reconnect, the original accepted server socket would (after a couple of
> connection attempts from the DOS box) see a 0-length recv and close,
allowing
> the server to accept a new connection and resume receiving images.
You?re relying on undocumented behavior here.

I don?t know exactly what was going on before [*] but the new behavior is at
least legal, and probably better.  It is preventing a bogus reconnection, which
could be used for nefarious purposes.  (Connection hijacking, etc.)

[*] Your ?flailing about? diagnosis is somewhat lacking in its level of rigor.
:)  I think if you look more deeply into it, you?ll be shocked at how thin the
ice you?ve been dancing on is.
> Possibly relevant facts:
Oh, yeah.  Relevant like rashes are to a diagnosis of chicken pox.
> - The DOS box uses the same local port (1025) every time it tries to
connect.
That?s legal only if you allow the previous connection to die first, via the
TIME_WAIT delay.  Until that delay expires, the connection?s 5-tuple [**] is
still in use, and the kernel is right to refuse to accept another SYN using the
same 5-tuple.

Another poster recommended SO_REUSEADDR, but that?s just a hack around the
TIME_WAIT delay.

The correct fix is to change the DOS app to use an ephemeral port number.  That
won?t 100% fix it, because you?ll still have a 1:16,383 chance [***] of causing
the same problem as you?ve run into now, but that sounds live-able to me.  If
you reboot only once a week, you?d have to be Yoda to have much reason to be
worried about running into this again during the balance of your tenure with
this company.

If you?re really worried about it, write the prior port to a text file on
program startup and avoid that one on the next run.

Oh, let me guess the objection: old binary-only DOS app, no source code
available, programmers long since vanished, right?

[**] Transport protocol, local port, local IP, remote port, remote IP.  At least
one must be different for a new connection to be allowed.

[***] The IANA ephemeral port range
(https://en.wikipedia.org/wiki/Ephemeral_port) has about 16k ports.  I spent
some time puzzling over the probabilities, and I?m pretty sure you don?t count
two ?draws? here: you?re only concerned with the chance that the *next* port you
pick will be equal to the preceding one.

Warren Young

2015-Jan-16 18:21 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

A couple more thoughts...

On Jan 16, 2015, at 10:42 AM, Warren Young <wyml at etr-usa.com> wrote:
> On Jan 15, 2015, at 11:40 AM, Glenn Eychaner <geychaner at mac.com>
wrote:
> 
>> When the DOS box exits, crashes, or is rebooted, it fails to shut down
the
>> socket properly.
> 
> Yes, that?s what happens when you use an OS that doesn?t implement sockets
in kernel space: there is no program still running that can send the RST packet
for the dead socket.
That said, your Linux/Python side code shouldn?t be relying on the RST anyway. 
A power blip that unceremoniously reboots the DOS box will also skip the RST. 
That happens with *all* TCP stacks, even in-kernel ones.

True war story, seen on devices from multiple vendors: 

The setup: An embedded system has a TCP listener.  Some network problem [*]
causes packet loss for an extended period, causing an established peer to time
out and drop its conn.  The packet loss also prevents the RST/FIN from getting
to the embedded device, so it thinks it?s still connected.  Because the embedded
device?s programmer is counting every processor cycle, he makes it so it only
handles a single TCP connection at a time.

The result: The embedded box is now unreachable until boots on the ground walk
over and power-cycle it.

The fix: Make the embedded TCP listener either a) allow multiple TCP
connections; or b) drop the prior TCP conn when a new one comes in.

The lesson: If your TCP/IP program was easy to write, it isn?t robust.  You?ve
missed *something*.

[*] It could be a misconfiguration, broken cable, firmware update, power-cycled
wiring closet, etc.
> The correct fix is to change the DOS app to use an ephemeral port number.
That also fixes the ?missing RST? problem I?ve described above.  If by some bad
bit of luck the DOS box happens to pick the same ephemeral port number after a
reboot that it was using before, it will get RST.  The DOS app will then retry,
causing the DOS TCP stack to pick a different ephemeral port, so it will
succeed.

A different fix is to exploit the real-time nature of video camera imagery: if
your Python app goes more than a second without receiving an image frame, it can
presume that the DOS box has disappeared again, and drop its conn.  By the time
the DOS box reboots, TIME_WAIT may have expired, so the DOS box might reconnect
without a problem.

You may wish to reduce tcp_fin_timeout to ensure that TIME_WAIT does indeed
expire before the DOS box reboots, per http://goo.gl/zQCzqK

Glenn Eychaner

2015-Jan-21 16:49 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

I'd like to thank everyone for their replies and advice. I'm sorry it
took so
long for me to respond; I took a long weekend after a long shift. Some
remaining questions can be found in the final section of this posting. The
summary (I hope i have all of this correct):

Problem:
A DOS box (client) connects to a Linux box (server) using the same local port
(1025) on the client each time. The client sends data which the server reads;
the server is passive and does not write any data. If the client crashes and
fails to properly close the connection, under CentOS 6.5, the unclosed
listener on the server receives a 0-length recv(), allowing for a
"clean"
reconnect; under 6.6, it does not, and the client unsuccessfully retries the
reconnect endlessly.

Diagnosis:
Because the client is connecting using the same port every time, the server
sees the same 5-tuple each time. At that point, the reconnection should fail
until the old socket on the server is closed, and the previous behavior of
receiving a 0-length recv() on the old server socket is unsupported and
unreliable. Until the update to CentOS 6.6 'broke' the existing
functionality,
I had never looked deeply into the connection between the client and the
server; it 'just worked', so I left it alone. Once it did break, I
realized
that because the client was connecting on the same port every time, the
whole setup might have been relying on unsupported behavior.

My workaround:
I unfortunately had to implement an emergency workaround before receiving any
replies. Fortunately, the client also sends status messages to the same
computer (but a different server program) over a serial-port side-channel
(well, it's more complicated than that, but anyway). I set up a listener for
a
"failed connection" status message which signal()s the server program
to close
all client connections (but not the bound dispatchers) and thereby force all
clients to reconnect. It's a cheat and a cheesy hack, but it works.

Other diagnostics:
One test I intend to run in a couple of weeks (next opportunity) is to boot
the CentOS 6.6 box with the older kernel, in order to find out whether the
behavior change is in the kernel or in the libraries.

Correct solutions:
1) Client port: The client should be connecting on a random, ephemeral port
like a good client instead of on a fixed port, which I suspected. I don't
know
if this can be changed (due to a really dumb binary TCP driver).
2) Protocol change: The server never writes to the socket in the existing
protocol, and can therefore never find out that the connection is dead.
Writing to the socket would reveal this. But what happens if the server writes
to the socket, and the client never reads? (We do, as it happens, have access
to the client software, so the protocol can be fixed eventually. But I'm
still
curious as to the answer.)
3) Several people suggested using SO_REUSEADDR and/or an SO_LINGER of zero to
drop the socket out of TIME_WAIT, but does the socket enter TIME_WAIT as soon
as the client crashes? I didn't think so, but I may be wrong.
4) Several people suggested SO_KEEPALIVE, but those occur only after hours
unless you change kernel parameters via procfs and/or sysctl, and when the
client crashes, I need recovery right away, not hours down the road. Time here
is literally worth a dollar per second, roughly.

Anyway, thanks for the discusssion and helpful links. At one time I knew all
this stuff, but it has been 20 years since I had to dig into the TCP protocol
this deeply.

-G.
--
Glenn Eychaner (geychaner at lco.cl)
Telescope Systems Programmer, Las Campanas Observatory

Les Mikesell

2015-Jan-21 17:38 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

On Wed, Jan 21, 2015 at 10:49 AM, Glenn Eychaner <geychaner at mac.com>
wrote:> > 2) Protocol change: The server never writes to the socket in the
existing
> protocol, and can therefore never find out that the connection is dead.
> Writing to the socket would reveal this. But what happens if the server
writes
> to the socket, and the client never reads? (We do, as it happens, have
access
> to the client software, so the protocol can be fixed eventually. But
I'm still
> curious as to the answer.)
If you can change the client, and you want to keep essentially
re-using the same socket after a reboot, can't you simply send a RST
on it when starting up  and then re-connect- or even run a different
program ahead starting that just sends a RST with that
source/dest/port combination?  That should make the server side
abandon that connection and accept another, although you may still
need to play tricks on the server side to avoid TIME_WAIT.

-- 
  Les Mikesell
     lesmikesell at gmail.com

Gordon Messmer

2015-Jan-21 18:09 UTC

head link

[CentOS] Socket behavior change from 6.5 to 6.6

On 01/21/2015 08:49 AM, Glenn Eychaner wrote:> Diagnosis:
> the previous behavior of
> receiving a 0-length recv() on the old server socket is unsupported and
> unreliable.
You mention that a lot, and it might help to understand why that happens.

A 0 length recv() on a standard (blocking) socket indicates end-of-file. 
  The remote side has closed the connection.

What you were previously seeing was the client sending SYN to establish 
a new connection.  Because it was unrelated to the existing connection 
on the same 5-tuple, the server's TCP stack closed the existing socket. 
  I'm not positive, but the server may have sent a keepalive or other 
probe to the client and got a RST.  Either way, the kernel determined 
that the socket had been closed by the client, and a 0-length read 
(recv) is the way that the kernel informs an application of that closure.
> Until the update to CentOS 6.6 'broke' the existing functionality,
> I had never looked deeply into the connection between the client and the
> server; it 'just worked', so I left it alone. Once it did break, I
realized
> that because the client was connecting on the same port every time, the
> whole setup might have been relying on unsupported behavior.
Not just unsupported, but incorrect.  Unrelated packets with a 5-tuple 
matching an established socket are typically injection attacks.  TCP is 
supposed to discard them.
> Other diagnostics:
> One test I intend to run in a couple of weeks (next opportunity) is to boot
> the CentOS 6.6 box with the older kernel, in order to find out whether the
> behavior change is in the kernel or in the libraries.
It's always good to test, but it's almost certainly the kernel. 
Libraries don't decide whether or not a socket has closed, which is what 
the 0-length read (recv) indicates.
> Correct solutions:
> 1) Client port: The client should be connecting on a random, ephemeral port
Yes.
> 2) Protocol change: The server never writes to the socket in the existing
> protocol, and can therefore never find out that the connection is dead.
> Writing to the socket would reveal this. But what happens if the server
writes
> to the socket, and the client never reads?
You will eventually fill up a buffer on one side or the other, and at 
that point any further write (send) will block forever.
> 3) Several people suggested using SO_REUSEADDR and/or an SO_LINGER of zero
to
> drop the socket out of TIME_WAIT, but does the socket enter TIME_WAIT as
soon
> as the client crashes? I didn't think so, but I may be wrong.
No.  It enters TIME_WAIT when the socket closes.  If the socket were 
closing, you'd be getting a 0-length read (recv).  You can confirm that 
with "netstat"

Maybe Matching Threads

Search for more reasonably related threads

CentOS - Jan 2015 - Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

[CentOS] Socket behavior change from 6.5 to 6.6

Maybe Matching Threads