thr3ads.net - openssh unix dev - SSH hang question [Nov 2019]

If this information is useful, please help other people find it:
Share via:

Steve McAfee

2019-Nov-09 18:07 UTC

SSH hang question

Very rarely, but it has repeated, we see openssh on the client side
hanging. On the server side there is no indication of connection in the
logs. These are always scripted remote commands that do not have user
interaction when we find it. This seems to be happening only in vm
environments but I could be wrong. It seems surprising to me that there
would not be timeouts and retries on the protocol, but I'm curious if this
is expected behavior in some configuration or if not, what should I try to
gather when it happens again? Or maybe there is some setting to make the
connection reliable.

We have seen this maybe 2 or 3 times over a couple of years so it is not
frequent. Happened yesterday during a complex distributed installation
process between rhel 7 vm's in the same data center lan.

Any advice appreciated.

steve

Darren Tucker

2019-Nov-10 07:58 UTC

head link

SSH hang question

On Sun, 10 Nov 2019 at 05:10, Steve McAfee <smcafee.social at gmail.com>
wrote:> Very rarely, but it has repeated, we see openssh on the client side
> hanging. On the server side there is no indication of connection in the
> logs. These are always scripted remote commands that do not have user
> interaction when we find it. This seems to be happening only in vm
> environments but I could be wrong.
What's the VM platform and underlying network technology?  At least one
(VMWare
Fusion) is known to have problems, although yours doesn't sound
exactly like this:
https://marc.info/?l=openssh-unix-dev&m=153535111501535&w=2
> It seems surprising to me that there
> would not be timeouts and retries on the protocol,
SSH is built on top of TCP, which provides the reliable bytestream and
thus implements the
timeouts and retries, so if you can find the problematic connection in
the output of netstat
you mayget some clues about what's going on.

One of the failure modes that can behave as you describe is the infamous TCP MTU
blackhole, wherein a large packet gets fragmented, the 2nd fragment
gets dropped for
some reason and the IP packet times out during reassembly.  TCP
retransmits the packet,
which again gets fragmented and the cycle repeats until TCP eventually
times out the
connection.  PPPoE and 802.1Q vlans are common culprits because they reduce the
MTUs just a little bit.

I'd suggest checking:
 - netstat for the failing connections looking for increasing SendQ values,
 - netstat -s on problematic machines looking for atypical counter values
 - MTUs on the hosts and everything in between them.

If it's none of these things then it's probably time to break out
tcpdump.
> Or maybe there is some setting to make the connection reliable.
The ServerAliveInterval and ServerAliveCount settings can detect ths
class of failure
I described above, but in those cases the root cause is a broken
network and the network
is what needs to be fixed.

-- 
Darren Tucker (dtucker at dtucker.net)
GPG key 11EAA6FA / A86E 3E07 5B19 5880 E860  37F4 9357 ECEF 11EA A6FA (new)
    Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.

Gert Doering

2019-Nov-10 09:15 UTC

head link

SSH hang question

Hi,

On Sun, Nov 10, 2019 at 06:58:47PM +1100, Darren Tucker
wrote:> One of the failure modes that can behave as you describe is the infamous
TCP MTU
> blackhole, wherein a large packet gets fragmented, the 2nd fragment
> gets dropped for
> some reason and the IP packet times out during reassembly.  
I've run into mobile networks recently that drop packets if you change
the QoS flags.  So SSH negotiation works fine, afterwards the client
changes QoS bits to "interactive", and that seems to confuse their
nat gateway...  "ssh $machine $command" worked, so I changed my
.ssh/config
to

host $myjumphost
  # gert, 19.10.19, "wie non-interactive session" - DTAG hakt grad mal
  ipqos cs1

... and it went back to working.

Might or might not be the case here.

gert

-- 
"If was one thing all people took for granted, was conviction that if you 
 feed honest figures into a computer, honest figures come out. Never doubted 
 it myself till I met a computer with a sense of humor."
                             Robert A. Heinlein, The Moon is a Harsh Mistress

Gert Doering - Munich, Germany                             gert at
greenie.muc.de

Philipp Marek

2019-Nov-10 10:29 UTC

head link

SSH hang question

ISTR that I've read about some bugs with packet traversal through VM NAT 
setups.
It was something about the RST packet triggering NAT destruction but not 
being relayed further all the time (some race condition?!), that sounds 
as if it fit the bill here.


https://forums.virtualbox.org/viewtopic.php?f=1&t=20579 comes close but 
is a bit old...


You could try to use the TCP keepalive settings in SSH, and/or to reduce 
the MTU size (as mentioned in another mail here already).


Else a few more setup details might help...

Steve McAfee

2019-Nov-15 14:24 UTC

head link

SSH hang question

Thanks everyone for the feedback on the OpenSSH hang. I'm going to ask the
customer to review mtu in their configuration first to see if they can find
a problem. Also, if their host OS is windows there are some things
suggested to check, but I don't think it is windows. If it ever happens
again I'll try to investigate more as suggested by Darren before we
interrupt it.

steve

On Sun, Nov 10, 2019 at 5:29 AM Philipp Marek <philipp at marek.priv.at>
wrote:
> ISTR that I've read about some bugs with packet traversal through VM
NAT
> setups.
> It was something about the RST packet triggering NAT destruction but not
> being relayed further all the time (some race condition?!), that sounds
> as if it fit the bill here.
>
>
> https://forums.virtualbox.org/viewtopic.php?f=1&t=20579 comes close but
> is a bit old...
>
>
> You could try to use the TCP keepalive settings in SSH, and/or to reduce
> the MTU size (as mentioned in another mail here already).
>
>
> Else a few more setup details might help...
>

Reasonably Related Threads

Search for more apparently analagous threads

openssh unix dev - Nov 2019 - SSH hang question

SSH hang question

SSH hang question

SSH hang question

SSH hang question

SSH hang question

Reasonably Related Threads