thr3ads.net - openssh unix dev - AW: sshd hangs [Jan 2005]

If this information is useful, please help other people find it:
Share via:
Martin.Dudle at mgb.ch
2005-Jan-24 15:22 UTC
AW: sshd hangs

hello

applied the patch described below - unfortunately we still experience
rare hangs of the remote sshd. not surprising as the patch only changes
a few lines in server_loop() - but not in server_loop2() which i used
for non-interactive sessions.

process id of hanging sshd: 26110

process is sleeping forever in poll (why does server_loop2() sleep
forever?):
root at XXX:~# truss -fp 26110
26110:  poll(0xFFBEF268, 2, -1)         (sleeping...)

no child processes are around:
root at XXX:~# ps -ef | grep 26110
    root 26110 24012  0 14:50:11 ?        0:00 /usr/local/sbin/sshd
    root  8136  7433  0 15:15:34 pts/5    0:00 truss -fp 26110
    root  8217  7433  0 15:15:55 pts/5    0:00 grep 26110

sending it a SIGCLD to see if ECHILD would have been handled fine (would
not :-/).
root at XXX:~# kill -CLD 26110
26110:      Received signal #18, SIGCLD, in poll() [caught]
26110:  poll(0xFFBEF268, 2, -1)                         Err#4 EINTR
26110:  sigaction(SIGCLD, 0x00000000, 0xFFBEEDD0)       = 0
26110:  write(6, "\0", 1)                               = 1
26110:  setcontext(0xFFBEEF50)
26110:  sigprocmask(SIG_BLOCK, 0xFFBEF328, 0xFFBEF338)  = 0
26110:  waitid(P_ALL, 0, 0xFFBEF240, WEXITED|WTRAPPED|WNOHANG) Err#10
ECHILD
26110:  sigprocmask(SIG_SETMASK, 0xFFBEF338, 0x00000000) = 0
26110:  poll(0xFFBEF268, 2, -1)                         = 1
26110:  read(4, "\0", 1)                                = 1
26110:  read(4, 0xFFBEF2CF, 1)                          Err#11 EAGAIN
26110:  sigprocmask(SIG_BLOCK, 0xFFBEF328, 0xFFBEF338)  = 0
26110:  sigprocmask(SIG_SETMASK, 0xFFBEF338, 0x00000000) = 0
26110:  poll(0xFFBEF268, 2, -1)         (sleeping...)

stack:> $c
libc.so.1`_poll+4(b, 0, 0, ffbef278, 68dc8, ffbef268)
0x1f278(ffbef3c4, ffbef3c0, ffbef3bc, ffbef3b8, 0, 1)
server_loop2+0xe0(6e518, 0, 0, ff078000, 2151c, 1)
do_authenticated+0x80(6e518, 6e518, 6e518, ffbef4c4, 2151c, 66000)
main+0xc28(2e, 68d88, 64000, 1, 1ed0, 66674)
_start+0x5c(0, 0, 0, 0, 0, 0)

disassemble trace:
server_loop2+0xe0:              call      -0x102c       <0x1f118>

0x1f0f0:                        sethi     %hi(0x46c00), %o0
...
0x1f24c:                        add       %fp, -0x18, %o4
0x1f250:                        sll       %o0, 5, %g1
0x1f254:                        sub       %g1, %o0, %g1
0x1f258:                        sll       %g1, 2, %g1
0x1f25c:                        add       %g1, %o0, %g1
0x1f260:                        sll       %g1, 3, %g1
0x1f264:                        st        %g1, [%fp - 0x14]
0x1f268:                        ld        [%i2], %o0
0x1f26c:                        ld        [%i0], %o1
0x1f270:                        ld        [%i1], %o2
0x1f274:                        add       %o0, 1, %o0
0x1f278:                        call      +0x439b8
<PLT=libc.so.1`select>
0x1f27c:                        clr       %o3

c code (patched):
static void
collect_children(void)
{
        pid_t pid;
        sigset_t oset, nset;
        int status;

        /* block SIGCHLD while we check for dead children */
        sigemptyset(&nset);
        sigaddset(&nset, SIGCHLD);
        sigprocmask(SIG_BLOCK, &nset, &oset);
        if (child_terminated) {
                while ((pid = waitpid(-1, &status, WNOHANG)) > 0 ||
                    (pid < 0 && errno == EINTR))
                        if (pid > 0)
                                session_close_by_pid(pid, status);
                child_terminated = 0;
        }
        sigprocmask(SIG_SETMASK, &oset, NULL);
}


while there could be code to remove the hang (have select() in
server_loop2() not wait forever, have collect_children detect and handle
ECHILD properly) i think that the child process should not die or
terminate undetected by the parent in the first place.

will try to find why this happens and let you know if i find something.

regards,
-martin



Martin Dudle wrote:> using openssh-3.8.1p1 from sunfreeware.com on a SunOS XXX 5.8
> Generic_117000-03 sun4u sparc SUNW,Sun-Fire-V240.
> 
> sshd seems to ignore or miss SIGCLD. this is a rare behaviour we 
> observe
> about once per week in a ssh intensive environment.
Try the patch attached to this bug:
bugzilla.mindrot.org/show_bug.cgi?id=967

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
     Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.
Possibly Parallel Threads

Search for more possibly parallel threads
openssh unix dev - Jan 2005 - AW: sshd hangs

AW: sshd hangs

Possibly Parallel Threads

Wisdom of the Ancients