We have an application, running under ssh-agent, which fires off a large
number of ssh processes, all of which try to talk to the agent through the
UNIX domain socket under /tmp. When the agent is slow to respond and the
listen queue fills up, connect()s start to fail with ECONNREFUSED, and ssh
exits (agent authentication being used exclusively). To some extent this
problem can be mitigated by increasing the listen queue in ssh-agent.c, but it
only masks the problem: the client should retry a number of times, possibly
forever, when the connect() fails temporarily and is likely to succeed in the
future.
With SSH-1.2.27's ssh this happens in authfd.c, line 372; if the connect()
fails (because of ECONNREFUSED), ssh silently gives up trying to talk to the
agent:
sock = socket(AF_UNIX, SOCK_STREAM, 0);
if (sock < 0)
{
error("Socket failed");
if (newauthsockdir != NULL)
{
unlink(authsocket);
chdir("/");
rmdir(newauthsockdir);
xfree(newauthsockdir);
}
xfree(authsocketdir);
return -1;
}
if (connect(sock, (struct sockaddr *)&sunaddr,
AF_UNIX_SIZE(sunaddr)) < 0)
{
close(sock);
if (newauthsockdir != NULL)
{
unlink(authsocket);
chdir("/");
rmdir(newauthsockdir);
xfree(newauthsockdir);
}
xfree(authsocketdir);
return -1;
}
We fixed SSH-1.2.27 by wrapping this part of the code in a while-loop (looping
if errno == ECONNREFUSED), and this appears to work well, solving our
immediate problem.
In OpenSSH, it looks like ssh_get_authentication_socket() in authfd.c could
easily be made to act in a similar fashion. It would be great if OpenSSH
would handle this situation more gracefully as well.
Thanks,
--
Jos Backus _/ _/_/_/ "Modularity is not a
hack."
_/ _/ _/ -- D. J. Bernstein
_/ _/_/_/
_/ _/ _/ _/
josb at cncdsl.com _/_/ _/_/_/ use Std::Disclaimer;