-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Almost like clockwork, every 3 days, I have one server that starts to generate errors similar to below ... it isn't a 'continous thing' at the start, but gradually grows worse ... it just started happening again today, after 3 days, 2hrs of uptime ... Mar 20 07:59:26 mars sshd[717]: error: reexec socketpair: No buffer space available As unrelated as this might sound, out of three servers that are virtually identical, this is the only one using gmirror for its drives vs a hardware raid controller, two of the three running kernels from about the same time ... # ssh jupiter uname -a FreeBSD jupiter.hub.org 6.2-STABLE FreeBSD 6.2-STABLE #1: Fri Mar 16 13:13:02 ADT 2007 root@jupiter.hub.org:/usr/obj/usr/src/sys/kernel i386 vs # ssh mars uname -a FreeBSD mars.hub.org 6.2-STABLE FreeBSD 6.2-STABLE #5: Tue Mar 13 02:29:37 ADT 2007 root@mars.hub.org:/usr/obj/usr/src/sys/kernel i386 jupiter is running more on it then mars right now ... So, I either have something mis-configured on mars that is done right on jupiter, or there is a bug that is being tickled on mars that isn't being tickled on jupiter ... If I have a login session on the machine, I can easily do a reboot of the machine, and it seems to come up clean every time (ie. no fsck's need to be run) ... Does anyone have any ideas of what I can look at? I've checked nmbclusters between the two machines, and both are at 25600, but not sure what sysctl to look at for how much is actually used out of that 25600 ... - ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGA+sG4QvfyHIvDvMRAoRuAJ9LXJ5RUZNXEQhEwkDFiMudThyASgCeNJXu 9Y7KZ6fSlk07/WmHGywTvJ4=n3XS -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Almost like clockwork, every 3 days, I have one server that starts to generate errors similar to below ... it isn't a 'continous thing' at the start, but gradually grows worse ... Mar 20 07:59:26 mars sshd[717]: error: reexec socketpair: No buffer space available As unrelated as this might sound, out of three servers that are virtually identical, this is the only one using gmirror for its drives vs a hardware raid controller, two of the three running kernels from about the same time ... # ssh jupiter uname -a FreeBSD jupiter.hub.org 6.2-STABLE FreeBSD 6.2-STABLE #1: Fri Mar 16 13:13:02 ADT 2007 root@jupiter.hub.org:/usr/obj/usr/src/sys/kernel i386 vs # ssh mars uname -a FreeBSD mars.hub.org 6.2-STABLE FreeBSD 6.2-STABLE #5: Tue Mar 13 02:29:37 ADT 2007 root@mars.hub.org:/usr/obj/usr/src/sys/kernel i386 jupiter is running more on it then mars right now ... So, I either have something mis-configured on mars that is done right on jupiter, or there is a bug that is being tickled on mars that isn't being tickled on jupiter ... If I have a login session on the machine, I can easily do a reboot of the machine, and it seems to come up clean every time (ie. no fsck's need to be run) ... Does anyone have any ideas of what I can look at? - ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGAV294QvfyHIvDvMRAogOAKCCbTIYS59dQFmV9/gfRth8nUZMpgCggZ9r 8zBIHioOQjlNBgovjv+eDA4=lIyS -----END PGP SIGNATURE-----
Marc G. Fournier wrote:> Mar 20 07:59:26 mars sshd[717]: error: reexec socketpair: No buffer space > available > > > If I have a login session on the machine, I can easily do a reboot of the > machine, and it seems to come up clean every time (ie. no fsck's need to be > run) ... > Does anyone have any ideas of what I can look at? >How odd. The re-exec feature is not documented in the man page. It appears that it can be turned off with the -r switch according to sshd.c. Can you give that a try and see if that offers symptomatic relief? It would be somewhat less secure as sshd will fork rather than fork..exec. The code does indeed appear to use socketpair. FreeBSD implements socketpair as a system call. Only AF_UNIX, SOCK_STREAM sockets are accepted. A quick look in KScope suggests the first place where this can fail with ENOBUFS is soalloc() from socreate(). Is this machine under heavy memory load in any way? soalloc() uses a zone allocator. I'm not sure how to track that from userland, vmstat -m only deals with kernel malloc() stats. BMS
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 - --On Monday, March 26, 2007 00:08:07 +0100 "Bruce M. Simpson" <bms@FreeBSD.org> wrote:> Marc G. Fournier wrote: >> Mar 20 07:59:26 mars sshd[717]: error: reexec socketpair: No buffer space >> available >> >> >> If I have a login session on the machine, I can easily do a reboot of the >> machine, and it seems to come up clean every time (ie. no fsck's need to be >> run) ... >> Does anyone have any ideas of what I can look at? >> > How odd. The re-exec feature is not documented in the man page. It appears > that it can be turned off with the -r switch according to sshd.c. Can you > give that a try and see if that offers symptomatic relief? It would be > somewhat less secure as sshd will fork rather than fork..exec.That was actually just one example ... I get more of: sendmail[82066]: l2NEA1Ht082066: SYSERR(root): makeconnection: cannot create socket: No buffer space available then I do the sshd errors ... in another 15 hours or so, they will all start up again, like clock work :( - ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGBxZ84QvfyHIvDvMRAoNTAKDBkGZL7aCOXEW22QibCCpnJJJnEgCfafMa ex0pM7sKPgCjVdURJ9nwfH0=egaO -----END PGP SIGNATURE-----
On Fri, 23 Mar 2007, Marc G. Fournier wrote:> I've checked nmbclusters between the two machines, and both are at 25600, > but not sure what sysctl to look at for how much is actually used out of > that 25600 ...netstat -mb nmbclusters directly affects the number of clusters available in the network stack; it also indirectly affects the scaling of other settings, such as resource limits on the number of sockets. vmstat -z is also generally useful. There are a few paths to ENOBUFS in the socket allocation code--one path is if you are over-committed on socket buffer resources with respect to the resource limits of the user. Check the output of limits and the socket buffer size limit. Robert N M Watson Computer Laboratory University of Cambridge