Bob Belnap
2009-Apr-17 16:04 UTC
Issues with ssh-agent connecting to a large number of hosts at once
Hi, I'm having problems with ssh-agent when I am connecting to a large (several hundred) hosts at once. I'm using a kanif ( http://taktuk.gforge.inria.fr/kanif/) which is a very nice package that distributes ssh connections across the hosts you are connecting to (a fan-out sort of approach, so all connections are not coming from one host). However, all hosts have to authenticate, so all the hosts have to wind their way back to the ssh-agent. This problem isn't isolated to just kanif, however. I see it when using other utilities that rely on many concurrent connections to the ssh-agent. running strace on the ssh-agent, things start out ok, then go sour and it starts spitting out: read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily unavailable) read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily unavailable) read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily unavailable) while pegging the cpu. Tracking the number of connections to the agent once every second (while true; do netstat -x | grep -c <agent socket name>; sleep 1) looks like: 5 5 5 35 98 154 155 200 287 287 at that point I kill the agent, but it will stick at that value if I don't. It's not always 287, but varies. I've seen it as high as 447 connections at once, but it's usually in the 200 range. I've tried different ssh-agents on different kernels and machines, and haven't found a combination that works. However, it seems like most FreeBSD machines I've tried did not have the problem. Also, using pagent on windows does not have any issues (*gasp*) It seems to me that I'm hitting some kind of kernel limit (open file limit perhaps?) But I've fiddled with every sysctl value I can find, and haven't found the right magic. Anyone run into this or can offer further debugging suggestions? (btw, ssh-v shows: OpenSSH_5.1p1 Debian-3ubuntu1, OpenSSL 0.9.8g) Thanks. --Bob
Bob Proulx
2009-Apr-18 01:48 UTC
Issues with ssh-agent connecting to a large number of hosts at once
Bob Belnap wrote:> It seems to me that I'm hitting some kind of kernel limit (open file limit > perhaps?) But I've fiddled with every sysctl value I can find, and haven't > found the right magic. Anyone run into this or can offer further debugging > suggestions? (btw, ssh-v shows: OpenSSH_5.1p1 Debian-3ubuntu1, OpenSSL > 0.9.8g)I don't have a perfect understanding of this but not seeing anyone else say anything I will jump in and make some suggestions imperfect though they will be. Different types of kernels will handle this differently and will account for why different systems behave differently. But most have a limited amount of memory available for network resources. Quickly opening and closing network connections can cause memory to be consumed at a high right. Once the available memory is exceeded system calls fail for being out of resources until more resources are available. This is what you are seeing. Why do resources become consumed? Look at RFC793 and you will find the TCP state diagram. Look particularly at the TIME_WAIT state. You are probably creating many connections hanging around in the TIME_WAIT state after they are closed and until the timeout. Each of those consumes network memory. You can see these connections by looking at the state reported by netstat. (e.g. 'netstat | grep TIME_WAIT') If you see many connections in the TIME_WAIT state then this is what you are running into. In many kernels with a limited amount of network resources this limits the rate at which connections may be created and closed. I am not familiar with TakTuk but it appears to try to avoid this problem by spreading the load around. That is good. But perhaps you are still exceeding the system limits. It appears to me that you are. This isn't really particular to ssh but is generic to anything that creates TCP connections. Since ssh uses TCP it has the same limitation as any other program that uses TCP and leaves connections in the TIME_WAIT state until they timeout and their resources are reclaimed. Hope that helps. Bob
Kevin Steves
2009-Apr-22 22:58 UTC
Issues with ssh-agent connecting to a large number of hosts at once
On Fri, Apr 17, 2009 at 10:04:34AM -0600, Bob Belnap wrote: : read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily : unavailable) looks like select() tells us a non-blocking fd is ready for reading but there is nothing to read and we loop forever on EAGAIN. is it an ssh(1) that is connecting to the agent? there is an ssh-agent -d option, you could add some debug() to troubleshoot.