Very rarely, but it has repeated, we see openssh on the client side hanging. On the server side there is no indication of connection in the logs. These are always scripted remote commands that do not have user interaction when we find it. This seems to be happening only in vm environments but I could be wrong. It seems surprising to me that there would not be timeouts and retries on the protocol, but I'm curious if this is expected behavior in some configuration or if not, what should I try to gather when it happens again? Or maybe there is some setting to make the connection reliable. We have seen this maybe 2 or 3 times over a couple of years so it is not frequent. Happened yesterday during a complex distributed installation process between rhel 7 vm's in the same data center lan. Any advice appreciated. steve
On Sun, 10 Nov 2019 at 05:10, Steve McAfee <smcafee.social at gmail.com> wrote:> Very rarely, but it has repeated, we see openssh on the client side > hanging. On the server side there is no indication of connection in the > logs. These are always scripted remote commands that do not have user > interaction when we find it. This seems to be happening only in vm > environments but I could be wrong.What's the VM platform and underlying network technology? At least one (VMWare Fusion) is known to have problems, although yours doesn't sound exactly like this: https://marc.info/?l=openssh-unix-dev&m=153535111501535&w=2> It seems surprising to me that there > would not be timeouts and retries on the protocol,SSH is built on top of TCP, which provides the reliable bytestream and thus implements the timeouts and retries, so if you can find the problematic connection in the output of netstat you mayget some clues about what's going on. One of the failure modes that can behave as you describe is the infamous TCP MTU blackhole, wherein a large packet gets fragmented, the 2nd fragment gets dropped for some reason and the IP packet times out during reassembly. TCP retransmits the packet, which again gets fragmented and the cycle repeats until TCP eventually times out the connection. PPPoE and 802.1Q vlans are common culprits because they reduce the MTUs just a little bit. I'd suggest checking: - netstat for the failing connections looking for increasing SendQ values, - netstat -s on problematic machines looking for atypical counter values - MTUs on the hosts and everything in between them. If it's none of these things then it's probably time to break out tcpdump.> Or maybe there is some setting to make the connection reliable.The ServerAliveInterval and ServerAliveCount settings can detect ths class of failure I described above, but in those cases the root cause is a broken network and the network is what needs to be fixed. -- Darren Tucker (dtucker at dtucker.net) GPG key 11EAA6FA / A86E 3E07 5B19 5880 E860 37F4 9357 ECEF 11EA A6FA (new) Good judgement comes with experience. Unfortunately, the experience usually comes from bad judgement.
Hi, On Sun, Nov 10, 2019 at 06:58:47PM +1100, Darren Tucker wrote:> One of the failure modes that can behave as you describe is the infamous TCP MTU > blackhole, wherein a large packet gets fragmented, the 2nd fragment > gets dropped for > some reason and the IP packet times out during reassembly.I've run into mobile networks recently that drop packets if you change the QoS flags. So SSH negotiation works fine, afterwards the client changes QoS bits to "interactive", and that seems to confuse their nat gateway... "ssh $machine $command" worked, so I changed my .ssh/config to host $myjumphost # gert, 19.10.19, "wie non-interactive session" - DTAG hakt grad mal ipqos cs1 ... and it went back to working. Might or might not be the case here. gert -- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress Gert Doering - Munich, Germany gert at greenie.muc.de
ISTR that I've read about some bugs with packet traversal through VM NAT setups. It was something about the RST packet triggering NAT destruction but not being relayed further all the time (some race condition?!), that sounds as if it fit the bill here. https://forums.virtualbox.org/viewtopic.php?f=1&t=20579 comes close but is a bit old... You could try to use the TCP keepalive settings in SSH, and/or to reduce the MTU size (as mentioned in another mail here already). Else a few more setup details might help...
Thanks everyone for the feedback on the OpenSSH hang. I'm going to ask the customer to review mtu in their configuration first to see if they can find a problem. Also, if their host OS is windows there are some things suggested to check, but I don't think it is windows. If it ever happens again I'll try to investigate more as suggested by Darren before we interrupt it. steve On Sun, Nov 10, 2019 at 5:29 AM Philipp Marek <philipp at marek.priv.at> wrote:> ISTR that I've read about some bugs with packet traversal through VM NAT > setups. > It was something about the RST packet triggering NAT destruction but not > being relayed further all the time (some race condition?!), that sounds > as if it fit the bill here. > > > https://forums.virtualbox.org/viewtopic.php?f=1&t=20579 comes close but > is a bit old... > > > You could try to use the TCP keepalive settings in SSH, and/or to reduce > the MTU size (as mentioned in another mail here already). > > > Else a few more setup details might help... >
Reasonably Related Threads
- [Bug 277] X11 forwarding problem behind Router/NAT box
- Established connection timing out
- Patches to enable MTUs >1500 in el5.6 ready for testing.
- Xen 10GBit Ethernet network performance (was: Re: Experience with Xen & AMD Opteron 4200 series?)
- Jumbo Frame performance or lackof?