using openssh-3.8.1p1 from sunfreeware.com on a SunOS XXX 5.8
Generic_117000-03 sun4u sparc SUNW,Sun-Fire-V240.
sshd seems to ignore or miss SIGCLD. this is a rare behaviour we observe
about once per week in a ssh intensive environment.
the process hangs here:
truss:
24453: poll(0xFFBEEF28, 2, -1) (sleeping...)
gcore, mdb:
libc.so.1`_poll+4(b, 0, 0, ffbeef38, 6fc40, ffbeef28)
0x20710(ffbef084, ffbef080, ffbef07c, ffbef078, 0, 1)
server_loop2+0xd4(6a800, 0, 0, ff1e8000, 2151c, 1)
do_authenticated+0x80(753b0, 6a400, f90, 1, 2151c, 6d800)
main+0xbf4(2f, 6fc00, 6a800, 1ecc, 1, 6dbd0)
_start+0x5c(0, 0, 0, 0, 0, 0)
the corresponding c sources are:
void
server_loop2(Authctxt *authctxt)
{
[ ... ]
for (;;) {
process_buffered_input_packets();
rekeying = (xxx_kex != NULL && !xxx_kex->done);
if (!rekeying && packet_not_very_much_data_to_write())
channel_output_poll();
wait_until_can_do_something(&readset, &writeset,
&max_fd,
&nalloc, 0);
[ ...]
and it hangs in the select() call in wait_until_can_do_something().
question: why is the wait time set to 0 (= wait forever) ? server_loop()
(the interactive function) does not set it to 0.
if the child exits without the parent noting it then we hung forever
which is bad.
i tried to send the process a SIGCLD by hand to see if it would 'unlock'
itself. here's the result:
# kill -CLD 24453
truss:
24453: Received signal #18, SIGCLD, in poll() [caught]
24453: poll(0xFFBEEF28, 2, -1) Err#4 EINTR
24453: sigaction(SIGCLD, 0x00000000, 0xFFBEEA90) = 0
24453: write(6, "\0", 1) = 1
24453: setcontext(0xFFBEEC10)
24453: sigprocmask(SIG_BLOCK, 0xFFBEEFE8, 0xFFBEEFF8) = 0
24453: waitid(P_ALL, 0, 0xFFBEEF00, WEXITED|WTRAPPED|WNOHANG) Err#10 ECHILD
24453: sigprocmask(SIG_SETMASK, 0xFFBEEFF8, 0x00000000) = 0
24453: poll(0xFFBEEF28, 2, -1) = 1
24453: read(4, "\0", 1) = 1
24453: read(4, 0xFFBEEF8F, 1) Err#11 EAGAIN
24453: sigprocmask(SIG_BLOCK, 0xFFBEEFE8, 0xFFBEEFF8) = 0
24453: sigprocmask(SIG_SETMASK, 0xFFBEEFF8, 0x00000000) = 0
24453: poll(0xFFBEEF28, 2, -1) (sleeping...)
it seems there is another problem here with collect_children() not
handling ECHILD:
{
pid_t pid;
sigset_t oset, nset;
int status;
/* block SIGCHLD while we check for dead children */
sigemptyset(&nset);
sigaddset(&nset, SIGCHLD);
sigprocmask(SIG_BLOCK, &nset, &oset);
if (child_terminated) {
while ((pid = waitpid(-1, &status, WNOHANG)) > 0 ||
(pid < 0 && errno == EINTR))
if (pid > 0)
session_close_by_pid(pid, status);
child_terminated = 0;
}
sigprocmask(SIG_SETMASK, &oset, NULL);
}
waitpid returns -1 with errno == ECHILD. child_terminated is set to
FALSE (why?) and that's it.
the program returns to the endless loop (for (;;)) in server_loop2() and
sleeps forever again.
could anyone shed some light into this thoughts? thanks.
regards,
-martin dudle