A couple more thoughts...
On Jan 16, 2015, at 10:42 AM, Warren Young <wyml at etr-usa.com> wrote:
> On Jan 15, 2015, at 11:40 AM, Glenn Eychaner <geychaner at mac.com>
wrote:
>
>> When the DOS box exits, crashes, or is rebooted, it fails to shut down
the
>> socket properly.
>
> Yes, that?s what happens when you use an OS that doesn?t implement sockets
in kernel space: there is no program still running that can send the RST packet
for the dead socket.
That said, your Linux/Python side code shouldn?t be relying on the RST anyway.
A power blip that unceremoniously reboots the DOS box will also skip the RST.
That happens with *all* TCP stacks, even in-kernel ones.
True war story, seen on devices from multiple vendors:
The setup: An embedded system has a TCP listener. Some network problem [*]
causes packet loss for an extended period, causing an established peer to time
out and drop its conn. The packet loss also prevents the RST/FIN from getting
to the embedded device, so it thinks it?s still connected. Because the embedded
device?s programmer is counting every processor cycle, he makes it so it only
handles a single TCP connection at a time.
The result: The embedded box is now unreachable until boots on the ground walk
over and power-cycle it.
The fix: Make the embedded TCP listener either a) allow multiple TCP
connections; or b) drop the prior TCP conn when a new one comes in.
The lesson: If your TCP/IP program was easy to write, it isn?t robust. You?ve
missed *something*.
[*] It could be a misconfiguration, broken cable, firmware update, power-cycled
wiring closet, etc.
> The correct fix is to change the DOS app to use an ephemeral port number.
That also fixes the ?missing RST? problem I?ve described above. If by some bad
bit of luck the DOS box happens to pick the same ephemeral port number after a
reboot that it was using before, it will get RST. The DOS app will then retry,
causing the DOS TCP stack to pick a different ephemeral port, so it will
succeed.
A different fix is to exploit the real-time nature of video camera imagery: if
your Python app goes more than a second without receiving an image frame, it can
presume that the DOS box has disappeared again, and drop its conn. By the time
the DOS box reboots, TIME_WAIT may have expired, so the DOS box might reconnect
without a problem.
You may wish to reduce tcp_fin_timeout to ensure that TIME_WAIT does indeed
expire before the DOS box reboots, per http://goo.gl/zQCzqK