hello
applied the patch described below - unfortunately we still experience
rare hangs of the remote sshd. not surprising as the patch only changes
a few lines in server_loop() - but not in server_loop2() which i used
for non-interactive sessions.
process id of hanging sshd: 26110
process is sleeping forever in poll (why does server_loop2() sleep
forever?):
root at XXX:~# truss -fp 26110
26110: poll(0xFFBEF268, 2, -1) (sleeping...)
no child processes are around:
root at XXX:~# ps -ef | grep 26110
root 26110 24012 0 14:50:11 ? 0:00 /usr/local/sbin/sshd
root 8136 7433 0 15:15:34 pts/5 0:00 truss -fp 26110
root 8217 7433 0 15:15:55 pts/5 0:00 grep 26110
sending it a SIGCLD to see if ECHILD would have been handled fine (would
not :-/).
root at XXX:~# kill -CLD 26110
26110: Received signal #18, SIGCLD, in poll() [caught]
26110: poll(0xFFBEF268, 2, -1) Err#4 EINTR
26110: sigaction(SIGCLD, 0x00000000, 0xFFBEEDD0) = 0
26110: write(6, "\0", 1) = 1
26110: setcontext(0xFFBEEF50)
26110: sigprocmask(SIG_BLOCK, 0xFFBEF328, 0xFFBEF338) = 0
26110: waitid(P_ALL, 0, 0xFFBEF240, WEXITED|WTRAPPED|WNOHANG) Err#10
ECHILD
26110: sigprocmask(SIG_SETMASK, 0xFFBEF338, 0x00000000) = 0
26110: poll(0xFFBEF268, 2, -1) = 1
26110: read(4, "\0", 1) = 1
26110: read(4, 0xFFBEF2CF, 1) Err#11 EAGAIN
26110: sigprocmask(SIG_BLOCK, 0xFFBEF328, 0xFFBEF338) = 0
26110: sigprocmask(SIG_SETMASK, 0xFFBEF338, 0x00000000) = 0
26110: poll(0xFFBEF268, 2, -1) (sleeping...)
stack:> $c
libc.so.1`_poll+4(b, 0, 0, ffbef278, 68dc8, ffbef268)
0x1f278(ffbef3c4, ffbef3c0, ffbef3bc, ffbef3b8, 0, 1)
server_loop2+0xe0(6e518, 0, 0, ff078000, 2151c, 1)
do_authenticated+0x80(6e518, 6e518, 6e518, ffbef4c4, 2151c, 66000)
main+0xc28(2e, 68d88, 64000, 1, 1ed0, 66674)
_start+0x5c(0, 0, 0, 0, 0, 0)
disassemble trace:
server_loop2+0xe0: call -0x102c <0x1f118>
0x1f0f0: sethi %hi(0x46c00), %o0
...
0x1f24c: add %fp, -0x18, %o4
0x1f250: sll %o0, 5, %g1
0x1f254: sub %g1, %o0, %g1
0x1f258: sll %g1, 2, %g1
0x1f25c: add %g1, %o0, %g1
0x1f260: sll %g1, 3, %g1
0x1f264: st %g1, [%fp - 0x14]
0x1f268: ld [%i2], %o0
0x1f26c: ld [%i0], %o1
0x1f270: ld [%i1], %o2
0x1f274: add %o0, 1, %o0
0x1f278: call +0x439b8
<PLT=libc.so.1`select>
0x1f27c: clr %o3
c code (patched):
static void
collect_children(void)
{
pid_t pid;
sigset_t oset, nset;
int status;
/* block SIGCHLD while we check for dead children */
sigemptyset(&nset);
sigaddset(&nset, SIGCHLD);
sigprocmask(SIG_BLOCK, &nset, &oset);
if (child_terminated) {
while ((pid = waitpid(-1, &status, WNOHANG)) > 0 ||
(pid < 0 && errno == EINTR))
if (pid > 0)
session_close_by_pid(pid, status);
child_terminated = 0;
}
sigprocmask(SIG_SETMASK, &oset, NULL);
}
while there could be code to remove the hang (have select() in
server_loop2() not wait forever, have collect_children detect and handle
ECHILD properly) i think that the child process should not die or
terminate undetected by the parent in the first place.
will try to find why this happens and let you know if i find something.
regards,
-martin
Martin Dudle wrote:> using openssh-3.8.1p1 from sunfreeware.com on a SunOS XXX 5.8
> Generic_117000-03 sun4u sparc SUNW,Sun-Fire-V240.
>
> sshd seems to ignore or miss SIGCLD. this is a rare behaviour we
> observe
> about once per week in a ssh intensive environment.
Try the patch attached to this bug:
http://bugzilla.mindrot.org/show_bug.cgi?id=967
--
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4 37C9 C982 80C7 8FF4 FA69
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.