Matt Day
2007-Mar-14 18:53 UTC
sshd gets stuck: select() in packet_read_seqnr waits indefinitely
Dear OpenSSH Portable sshd developers, I'm having a problem where sshd login sessions are occasionally (as often as once a day) getting stuck indefinitely. I enabled debug messages and got a backtrace of a stuck sshd, and I think I've found the bug. I wanted to run it by the list once before filing. sshd version: OpenSSH_4.2p1 FreeBSD-20050903, OpenSSL 0.9.7e-p1 25 Oct 2004 Uncommented lines (ie. nondefault settings) in sshd_config: LogLevel DEBUG ClientAliveInterval 90 Subsystem sftp /usr/libexec/sftp-server SSH client: PuTTY version 0.58, default settings OS/HW: FreeBSD 6.1-RELEASE running on 64-bit x86 ("amd64" platform) Executive summary: The select() in packet_read_seqnr() waits indefinitely, resulting in stuck SSH sessions when networking problems interfere with key exchange. Would like to be able to set a timeout there, or send SSH keepalives during key exchange. Periodically (every 60 minutes) the SSH client initiates rekeying via key exchange. Here's an example of a successful rekeying: Mar 11 19:02:35 SSH2_MSG_KEXINIT received Mar 11 19:02:35 SSH2_MSG_KEXINIT sent Mar 11 19:02:35 kex: client->server aes256-ctr hmac-sha1 none Mar 11 19:02:35 kex: server->client aes256-ctr hmac-sha1 none Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_GROUP sent Mar 11 19:02:35 expecting SSH2_MSG_KEX_DH_GEX_INIT Mar 11 19:02:38 SSH2_MSG_KEX_DH_GEX_REPLY sent Mar 11 19:02:38 set_newkeys: rekeying Mar 11 19:02:38 SSH2_MSG_NEWKEYS sent Mar 11 19:02:38 expecting SSH2_MSG_NEWKEYS Mar 11 19:02:38 set_newkeys: rekeying Mar 11 19:02:38 SSH2_MSG_NEWKEYS received In the failure case, sshd gets stuck during key exchange. The SSH session had been going fine for many hours, and then these were the last messages it logged: Mar 11 20:02:38 SSH2_MSG_KEXINIT received Mar 11 20:02:38 SSH2_MSG_KEXINIT sent Mar 11 20:02:38 kex: client->server aes256-ctr hmac-sha1 none Mar 11 20:02:38 kex: server->client aes256-ctr hmac-sha1 none Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_GROUP sent Mar 11 20:02:38 expecting SSH2_MSG_KEX_DH_GEX_INIT The user was idle when this happened, but had a program running that was generating output. That program became tty-blocked after about 30 minutes, presumably because sshd wasn't draining its output, and that's when I noticed the user's sshd was stuck and got a backtrace: (gdb) where #0 0x.. in select () from /lib/libc.so.6 #1 0x.. in packet_read_seqnr () from /usr/lib/libssh.so.3 #2 0x.. in packet_read () from /usr/lib/libssh.so.3 #3 0x.. in packet_read_expect () from /usr/lib/libssh.so.3 #4 0x.. in kexgex_server (kex=0x538900) at kexgexs.c:99 #5 0x.. in kex_setup () from /usr/lib/libssh.so.3 #6 0x.. in kex_input_kexinit () from /usr/lib/libssh.so.3 #7 0x.. in dispatch_run () from /usr/lib/libssh.so.3 #8 0x.. in process_buffered_input_packets () at serverloop.c:475 #9 0x.. in server_loop2 (authctxt=0x4) at serverloop.c:760 #10 0x.. in do_authenticated2 (authctxt=0x4) at session.c:2456 #11 0x.. in do_authenticated (authctxt=0x53a400) at session.c:227 #12 0x.. in main at sshd.c:1749 This backtrace agrees with the debug messages: it's in kexgex_server(), calling packet_read_expect(SSH2_MSG_KEX_DH_GEX_INIT), which ultimately calls select() from packet_read_seqnr(). The select call in packet_read_seqnr passes NULL for a timeout, meaning it will wait forever. That explains why the comment above it says "Note that no other data is processed until this returns, so this function should not be used during the interactive session." But, this was an interactive session. I've set ClientAliveInterval in sshd_config so that SSH sessions die in a timely manner when networking problems arise, but the keepalive is apparently not sent during key exchange. The default TCP keepalive on FreeBSD is unhelpful here; it only kicks in after 2 hours, and I need stuck SSH sessions to die a lot sooner. I want to keep the FreeBSD TCP keepalive defaults. Would it be possible for the select() in packet_read_seqnr to use an optional timeout? Similarly, I believe the select() in packet_write_wait has the same problem. Upon timeout, it would be fine with me if the session died with an error logged. Alternatively, if SSH keepalives were sent during key exchange, that would suffice. Thanks for reading, and thanks for making OpenSSH. I've been a happy user for many years now! Matt Day <opensshbugs at fjarlq.com>
Darren Tucker
2007-Mar-15 00:14 UTC
sshd gets stuck: select() in packet_read_seqnr waits indefinitely
On Wed, Mar 14, 2007 at 12:53:09PM -0600, Matt Day wrote:> Dear OpenSSH Portable sshd developers, > > I'm having a problem where sshd login sessions are occasionally > (as often as once a day) getting stuck indefinitely. I enabled debug > messages and got a backtrace of a stuck sshd, and I think I've found > the bug. I wanted to run it by the list once before filing.[...]> The select() in packet_read_seqnr() waits indefinitely, resulting > in stuck SSH sessions when networking problems interfere with > key exchange. Would like to be able to set a timeout there, or > send SSH keepalives during key exchange.[...]> The select call in packet_read_seqnr passes NULL for a timeout, > meaning it will wait forever. That explains why the comment above > it says "Note that no other data is processed until this returns, > so this function should not be used during the interactive session." > But, this was an interactive session. > > I've set ClientAliveInterval in sshd_config so that SSH sessions > die in a timely manner when networking problems arise, but the > keepalive is apparently not sent during key exchange. The default > TCP keepalive on FreeBSD is unhelpful here; it only kicks in after > 2 hours, and I need stuck SSH sessions to die a lot sooner. I want > to keep the FreeBSD TCP keepalive defaults. > > Would it be possible for the select() in packet_read_seqnr to use > an optional timeout? Similarly, I believe the select() in > packet_write_wait has the same problem. Upon timeout, it would be > fine with me if the session died with an error logged.You could try the attached patch. The timeout should perhaps be ClientAliveInterval * ClientAliveCountMax, (though you'd need to be wary of integer overflow).> Alternatively, > if SSH keepalives were sent during key exchange, that would suffice.You're not allowed to send most types of packets (this would include keepalives) once you've initiated a key exchange until it completes. Index: packet.c ==================================================================RCS file: /usr/local/src/security/openssh/cvs/openssh/packet.c,v retrieving revision 1.146 diff -u -p -r1.146 packet.c --- packet.c 17 Jan 2007 00:00:14 -0000 1.146 +++ packet.c 14 Mar 2007 23:48:23 -0000 @@ -136,6 +136,10 @@ static int server_side = 0; /* Set to true if we are authenticated. */ static int after_authentication = 0; +/* Set to the maximum time that we will wait for (or to send) a packet */ +static struct timeval packet_wait_tv; +static struct timeval *packet_wait_tvp = NULL; + /* Session key information for Encryption and MAC */ Newkeys *newkeys[MODE_MAX]; static struct packet_state { @@ -166,7 +170,7 @@ TAILQ_HEAD(, packet) outgoing; * packet_set_encryption_key is called. */ void -packet_set_connection(int fd_in, int fd_out) +packet_set_connection(int fd_in, int fd_out, int timeout) { Cipher *none = cipher_by_name("none"); @@ -187,6 +191,15 @@ packet_set_connection(int fd_in, int fd_ buffer_init(&incoming_packet); TAILQ_INIT(&outgoing); } + + if (timeout == 0) + packet_wait_tvp = NULL; + else { + packet_wait_tv.tv_sec = timeout; + packet_wait_tv.tv_usec = 0; + packet_wait_tvp = &packet_wait_tv; + } + } /* Returns 1 if remote host is connected via socket, 0 if not. */ @@ -923,7 +936,8 @@ packet_read_seqnr(u_int32_t *seqnr_p) FD_SET(connection_in, setp); /* Wait for some data to arrive. */ - while (select(connection_in + 1, setp, NULL, NULL, NULL) == -1 && + while (select(connection_in + 1, setp, NULL, NULL, + packet_wait_tvp) == -1 && (errno == EAGAIN || errno == EINTR)) ; @@ -1449,7 +1463,8 @@ packet_write_wait(void) memset(setp, 0, howmany(connection_out + 1, NFDBITS) * sizeof(fd_mask)); FD_SET(connection_out, setp); - while (select(connection_out + 1, NULL, setp, NULL, NULL) == -1 && + while (select(connection_out + 1, NULL, setp, NULL, + packet_wait_tvp) == -1 && (errno == EAGAIN || errno == EINTR)) ; packet_write_poll(); Index: packet.h ==================================================================RCS file: /usr/local/src/security/openssh/cvs/openssh/packet.h,v retrieving revision 1.47 diff -u -p -r1.47 packet.h --- packet.h 26 Mar 2006 03:30:02 -0000 1.47 +++ packet.h 14 Mar 2007 23:48:38 -0000 @@ -20,7 +20,7 @@ #include <openssl/bn.h> -void packet_set_connection(int, int); +void packet_set_connection(int, int, int); void packet_set_nonblocking(void); int packet_get_connection_in(void); int packet_get_connection_out(void); Index: ssh-keyscan.c ==================================================================RCS file: /usr/local/src/security/openssh/cvs/openssh/ssh-keyscan.c,v retrieving revision 1.91 diff -u -p -r1.91 ssh-keyscan.c --- ssh-keyscan.c 23 Oct 2006 17:01:16 -0000 1.91 +++ ssh-keyscan.c 14 Mar 2007 23:50:23 -0000 @@ -358,7 +358,7 @@ keygrab_ssh2(con *c) { int j; - packet_set_connection(c->c_fd, c->c_fd); + packet_set_connection(c->c_fd, c->c_fd, timeout); enable_compat20(); myproposal[PROPOSAL_SERVER_HOST_KEY_ALGS] = c->c_keytype == KT_DSA? "ssh-dss": "ssh-rsa"; Index: sshconnect.c ==================================================================RCS file: /usr/local/src/security/openssh/cvs/openssh/sshconnect.c,v retrieving revision 1.171 diff -u -p -r1.171 sshconnect.c --- sshconnect.c 23 Oct 2006 17:02:24 -0000 1.171 +++ sshconnect.c 14 Mar 2007 23:46:27 -0000 @@ -157,7 +157,7 @@ ssh_proxy_connect(const char *host, u_sh xfree(command_string); /* Set the connection file descriptors. */ - packet_set_connection(pout[0], pin[1]); + packet_set_connection(pout[0], pin[1], options.server_alive_interval); /* Indicate OK return */ return 0; @@ -385,7 +385,7 @@ ssh_connect(const char *host, struct soc error("setsockopt SO_KEEPALIVE: %.100s", strerror(errno)); /* Set the connection. */ - packet_set_connection(sock, sock); + packet_set_connection(sock, sock, options.server_alive_interval); return 0; } Index: sshd.c ==================================================================RCS file: /usr/local/src/security/openssh/cvs/openssh/sshd.c,v retrieving revision 1.362 diff -u -p -r1.362 sshd.c --- sshd.c 25 Feb 2007 09:37:22 -0000 1.362 +++ sshd.c 14 Mar 2007 23:45:09 -0000 @@ -1707,7 +1707,7 @@ main(int ac, char **av) * Register our connection. This turns encryption off because we do * not have a key. */ - packet_set_connection(sock_in, sock_out); + packet_set_connection(sock_in, sock_out, options.client_alive_interval); packet_set_server(); /* Set SO_KEEPALIVE if requested. */ -- Darren Tucker (dtucker at zip.com.au) GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4 37C9 C982 80C7 8FF4 FA69 Good judgement comes with experience. Unfortunately, the experience usually comes from bad judgement.