bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-09 04:06 UTC
[Bug 2756] New: sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 Bug ID: 2756 Summary: sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel Product: Portable OpenSSH Version: 6.7p1 Hardware: All OS: Linux Status: NEW Severity: normal Priority: P5 Component: sshd Assignee: unassigned-bugs at mindrot.org Reporter: willchan at google.com Hello, The short summary of my situation is I have a mobile client that establishes an ssh connection to a server, and uses remote port forwarding to expose access to local services. On the server-side, a monitoring service (a Prometheus instance we run) is polling via the remote port. When the mobile WAN connection dies, the client attempts to re-establish the ssh connection and the same remote port forwarding. It fails with a "error: channel_setup_fwd_listener_tcpip: cannot listen to port:". Our script keeps trying to reconnect every 15 seconds, but it fails until approximately 15 minutes later. I should note at this point that the client is running OpenSSH_7.2p2 and the server is running OpenSSH_6.7p1. Both are running Linux, albeit different distros. So, we thought we could handle this problem by setting ClientAliveInterval and ClientAliveMaxCount in the server's sshd_config. We set ClientAliveInterval to 10 and ClientAliveMaxCount to 3. But it does not appear to solve the issue. We dove in further, and have noted the following: * It appears that the old sshd process that is listening on the remote port is still alive, which explains the channel_setup_fwd_listener_tcpip error. * The old sshd process goes away after around 15~ minutes. * The server's tcp_retries2 is set to 15 (the default) * The monitoring service is polling every second * The server has many TCP sockets to the remote port forward in CLOSE_WAIT. I presume this is because the monitoring service is closing its connection to the remote forwarding channel, but the sshd process isn't closing its end of the connection, since the client hasn't closed the channel. * When we reduced tcp_retries2 to 8, the time for the sshd process to exit reduced to about 2~ minutes. * We also tried increasing our monitoring polling interval to 1 minute, which seemed to reduce the recovery time to under a minute. AFAICT, it seems to be the case that writing to the remote end of the forwarding channel can interfere with the ClientAliveInterval. Take this with many buckets of salt given I have never looked at the code before, but I poked into briefly and it appears to be the case that in the select call that uses the ClientAliveInterval as a timeout checks both read and write file descriptors. I was looking specifically at https://github.com/openssh/openssh-portable/blob/92e9fe633130376a95dd533df6e5e6a578c1e6b8/serverloop.c#L263. IIUC, then if something is constantly writing (e.g. our monitoring service) to the remote end of a channel, then the client_alive_check() never gets called, even if the connection to the client is dead. At this point, I figured I'd ask for help. Did I understand the code correctly that client liveness is not checked if the remote end of a forwarding channel receives data to forward onward to the client? If not, can anyone else help explain the situation we're seeing? Or if I managed to read the code correctly, can someone tell me if that's the desired behavior for ClientAliveInterval, and if so, how I should be configuring sshd to close the session when the client connection is dead, even if the remote end of the forwarding channel is being written to? Thanks in advance, and apologies in advance if I've missed something obvious or neglected to include important information. -Will -- You are receiving this mail because: You are watching the assignee of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-09 04:37 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 Darren Tucker <dtucker at zip.com.au> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dtucker at zip.com.au --- Comment #1 from Darren Tucker <dtucker at zip.com.au> --- (In reply to willchan from comment #0)> I should note at this point that the client is running OpenSSH_7.2p2 > and the server is running OpenSSH_6.7p1.I'd suggest trying 7.5p1, there was a keepalive bug (#2252) fixed in 7.3. -- You are receiving this mail because: You are watching the assignee of the bug. You are watching someone on the CC list of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-09 05:54 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 --- Comment #2 from willchan at google.com --- OK, thanks! I'll try testing it the next time I get my team together to repro this situation again. I'll get back to you on that one. What did you think about my questions around this code snippet which I linked earlier from wait_until_can_do_something()? /* Wait for something to happen, or the timeout to expire. */ ret = select((*maxfdp)+1, *readsetp, *writesetp, NULL, tvp); At a very quick glance, it looks possible that if the client connection is dead (and thus readsetp never becomes ready), the ClientAliveInterval (tvp) may never be hit if writesetp always becomes ready before the ClientAliveInterval expires. In my situation, a monitoring service polling a remote forwarding channel's server port at an interval shorter than ClientAliveInterval might conceivably trigger this. -- You are receiving this mail because: You are watching the assignee of the bug. You are watching someone on the CC list of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-09 08:56 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 --- Comment #3 from Darren Tucker <dtucker at zip.com.au> --- Created attachment 3029 --> https://bugzilla.mindrot.org/attachment.cgi?id=3029&action=edit keep track of the last time we heard from the client and trigger client_alive_check() (In reply to willchan from comment #2)> What did you think about my questions around this code snippet which > I linked earlier from wait_until_can_do_something()?I think you're right; the select won't time out so the client_alive_check() won't be triggered. Attached is an untested patch which might help... -- You are receiving this mail because: You are watching the assignee of the bug. You are watching someone on the CC list of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-09 16:02 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 --- Comment #4 from willchan at google.com --- Cool, thanks! I glanced briefly at the patch and it looks like it'll definitely help. The minor nit I have is it could also update the select timeout. It's OK as is, but it means that the client_alive_check() may be called up to, in worst case, just under a full ClientAliveInterval after it should. The worst case is when writesetp is ready when (last_client_time + options.client_alive_interval == monotime()), so it fails the (last_client_time + options.client_alive_interval < monotime()) check. And then the next select() call times out after a full ClientAliveInterval. That's a nit. This fixes the bulk of the issue AFAICT. Thanks. -- You are receiving this mail because: You are watching the assignee of the bug. You are watching someone on the CC list of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-10 06:27 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 Darren Tucker <dtucker at zip.com.au> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #3029|0 |1 is obsolete| | CC| |djm at mindrot.org Attachment #3030| |ok?(djm at mindrot.org) Flags| | --- Comment #5 from Darren Tucker <dtucker at zip.com.au> --- Created attachment 3030 --> https://bugzilla.mindrot.org/attachment.cgi?id=3030&action=edit keep track of the last time we heard from the client and trigger client_alive_check() I came up with the following to reproduce: 1) make sure you've got an inetd with the discard service enabled. 2) sshd -o ClientAliveInterval=3 -o ClientAliveCountMax=3 -p 2022 3) ssh -p 2022 -R 1234:localhost:9 localhost 4) while sleep 1; do echo foo; done | nc localhost 1234 5) pkill -STOP -u $USER -x ssh -current does indeed hang. I found that my first patch kills the connection too early because once the last_client_time check fires it'll fire again immediately, so last_client_time needs to be reset when that happens. With that it works more or less as expected. I'm not super concerned about the potential timing inaccuracy you mention as we're looking at redoing the select code to use something that allows a bit more flexibility and is easier to reason about. -- You are receiving this mail because: You are watching the assignee of the bug. You are watching someone on the CC list of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-10 06:27 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 Darren Tucker <dtucker at zip.com.au> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |2698 Referenced Bugs: https://bugzilla.mindrot.org/show_bug.cgi?id=2698 [Bug 2698] Tracking bug for OpenSSH 7.6 release -- You are receiving this mail because: You are watching someone on the CC list of the bug. You are watching the assignee of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-11 04:31 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 Darren Tucker <dtucker at zip.com.au> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #6 from Darren Tucker <dtucker at zip.com.au> --- I have committed a variant of this patch and it will be in the 7.6 release. Thanks for the report and analysis! -- You are receiving this mail because: You are watching someone on the CC list of the bug. You are watching the assignee of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2017-Aug-11 04:42 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 --- Comment #7 from willchan at google.com --- Thanks for the quick turnaround! Much appreciated. -- You are receiving this mail because: You are watching someone on the CC list of the bug. You are watching the assignee of the bug.
bugzilla-daemon at bugzilla.mindrot.org
2018-Apr-06 02:26 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 Damien Miller <djm at mindrot.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED --- Comment #8 from Damien Miller <djm at mindrot.org> --- Close all resolved bugs after release of OpenSSH 7.7. -- You are receiving this mail because: You are watching the assignee of the bug. You are watching someone on the CC list of the bug.
bugzilla-daemon at mindrot.org
2023-Jan-13 02:19 UTC
[Bug 2756] sshd does not seem to terminate despite ClientAlive[Interval|CountMax] when a process is polling a remote forwarding channel
https://bugzilla.mindrot.org/show_bug.cgi?id=2756 Damien Miller <djm at mindrot.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #3030|ok?(djm at mindrot.org) | Flags| | -- You are receiving this mail because: You are watching someone on the CC list of the bug. You are watching the assignee of the bug.