Can you test against 2.9p2 or the current snapshots.. There has been some
SIGCHLD changes since 2.5.2pX series.
- Ben
On Tue, 18 Sep 2001, Paul Menage wrote:
>
> We use ssh (RedHat 2.5.2p2-5) heavily in non-interactive mode, for
> managing servers from central controllers, and transferring applications/
> data around networks.
>
> Very occasionally we've seen the situation where the ssh client and
> server are both stuck in select, both selecting on only the tcp socket
> of the connection, and with no timeout. No children of sshd remain (even
> as zombies), and it has no other interesting open fds.
>
> If you send a SIGCHLD to the hung sshd, it wakes up and exits.
>
> As far as I can see, there's a race condition in
> wait_until_can_do_something(), both in RedHat 2.5.2p2-5 and in the
> latest CVS sources. It tests child_terminated, and sets a non-zero
> timeout if so, before calling select(). However, there is a very small
> window (between checking child_terminated and calling select() in which
> a SIGCHLD can arrive and set child_terminated. If this happens, and
> there is no other activity from the client or the child fds, sshd can
> hang indefinitely in the select().
>
> Catching this bug in the wild is not easy, as the window for the race
> condition is so small. However, it can be fairly easily reproduced under
> the following slightly artificial conditions, by using gdb to pause sshd
> within the window for long enough to kill the child:
>
> Run ssh -T -x localhost 'sleep 30s; echo X; exec >&- 2>&-
<&- sleep 5h'
>
> Find the sshd serving this connection, and connect to it with gdb. It
> will be in the middle of the actual select() system call.
>
> Set a breakpoint at the start of libc select() and continue. When the
> first sleep completes, the shell will print X, and sshd will hit the
> breakpoint in select().
>
> Continue once, and the X will get printed out at the client end of the
> connection, and sshd will hit the breakpoint again.
>
> By this time, the child shell has closed its fds and exec'd itself as
> "sleep 5h". Since the breakpoint is at the start of libc
select(), sshd
> has checked child_terminated, but not yet invoked the select() system
> call.
>
> Send a SIGKILL to the child "sleep" process. A SIGCHLD becomes
pending
> for sshd.
>
> Quit gdb, detaching from the process. At this point, sshd receives the
> SIGCHLD, sets child_terminated, and enters the select() system call. In
> the absence of any external events, the client/server are now
> deadlocked, even though the child has exited.
>
> In this situation, the child is clearly visible as a zombie; we have
> also seen the situation where there is no zombie child. I've not been
> able to reproduce this situation.
>
> I think that the correct way to fix this would probably be to use
> something like SIGIO and sigtimedwait() rather than select(), but that
> would be a substantial change. A simple fix for this problem would be to
> set a maximum timeout on the select() call of e.g. 15s. Are there any
> complications or bugs that could be introduced by such a change?
>
> Paul
>
>