Hi All,
I am not sure if this is the same thing as the hang on exit bug, so sorry if
this is a duplication of previous stuff.
Essetntially I am experiencing ssh hangs with about .5% - 1% of my
connections. I am running 2.9p2, on Solaris 7. I actually have empirical
data on the hangings, as I wrote a script to create these connections
in an endless loop, setting an alarm so I could recover from a hang.
I will place this script at the end of my email.
I am using RSA authentication with no passwords, going over an etherenet
network.
Here is a dump of running strace and pstack on the remote and local
ssh sessions:
Local:
truss:
poll(0xFFBEEFC0, 2, -1) (sleeping...)
pstack:
10879: ssh epapdev at mate ls /etc/hosts
ff217cfc poll (ffbeefc0, 2, ffffffff)
ff1cf6b0 select (ffbeefd0, ff238bc4, 14b480, ff238bc8,
14b484, a) + 298
0004cc44 client_wait_until_can_do_something (ffbef200,
ffbef1fc, ffbef1e4, 0, 9, 10000)
+ 3c4
0004e8a4 client_loop (0, ffffffff, 0, 14afb8, ff235ad4, 85308)
+ 6d4
00040c94 ssh_session2 (14afb8, 2, ffbef684, 141684, 144da8,
144da8) + 11c
0003f41c main (4, ffbef50c, ffbef520, 131c00, 0, 0) + 1cd4
0003cfbc _start (0, 0, 0, 0, 0, 0) + dc
Remote:
truss:
poll(0xFFBEF558, 2, -1) (sleeping...)
pstack
15390: /opt/TKLCplat/sbin/sshd
ff217cfc poll (ffbef558, 2, ffffffff)
ff1cf6b0 select (ffbef568, ff238bc4, 153230, ff238bc8,
153234, c) + 298
00052128 wait_until_can_do_something (ffbef6dc, ffbef6d8,
ffbef6d0, 0, 0, 0) + 500
0005387c server_loop2 (0, 0, 0, 0, 0, 0) + 19c
0005ab60 do_authenticated2 (153ea0, 0, 0, 0, ff235ad4, 54bd0) +
8
00054c40 do_authenticated (153ea0, 153ea0, 153ea0, 2000, ffff,
0) + b0
0004435c do_authentication2 (1187a0, 7, c30b, ffbefd64,
ff235ad4, 41888) + d4
00041914 main (1, ffbefdec, ffbefdf4, 138c00, 0, 0) + 267c
0003dedc _start (0, 0, 0, 0, 0, 0) + dc
truss only yields one call because I am calling it on the process after
the fact. The one thing I can see with my limited experience is that
both the remote and local processes are in the poll call with no timeout.
Since they are both polling forever, they are in a deadlock I suppose.
I have been somewhat following the hang on exit thread and gathered that
this might have something to do with tty's so I tried using the -T switch.
This over a six hour period yields the same ratio of hangs to successes
as not using the switch.
Is there any work around available for this? Also, do you need any more
information from me. If needed I could change my program to run truss
on every attepted session and save the results of the hung sessions.
Cheers...james
P.S. you must change the $host and $login variables to an RSA authenticated
machine of your choosing in the script below.
<<<Test Program follows>>>
#
my $count = 0;
my $test = "~~~ ring ~~~ ring ~~~~\n";
my $sshhung = 0;
my $success = 0;
my $evalerr = 0;
my $pid;
my $childpid;
my $rc;
my $login = ""; # Place your login here
my $host = ""; # place your host here.
while(1)
{
$count++;
print <<EOF;
Test #${count}
======================== Hangs: ${sshhung}
Eval Errors: ${evalerr}
Success: ${success}
EOF
eval {
local $SIG{ALRM} = sub { die $test };
$pid = fork();
die "Could not fork!" if($pid eq '');
if($pid == 0) # I am the child
{
exec('ssh', "${login}@${host}",
'ls', '/etc/hosts');
die "EXEC FAILED!!!";
} # End of child
#
# Ok, back in the parental role...
$childpid = '';
alarm(10);
$childpid = wait();
$rc = $? >> 8;
alarm(0);
};
#if we timed out then keys have not been exchanged.
# If any other error occurs we should die.
if($@)
{
# Any sort of error in hear means that the child
# May still be alive...time to die...
kill(9, $pid);
#
# Was this a syntax error?
if ($@ ne $test) { $evalerr++; }
else { $sshhung++; }
}
else { $success++; }
}