Marc G. Fournier
2007-May-04 01:09 UTC
Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'm trying to probe this as well as I can, but network stacks and sockets have never been my strong suit ... Robert had mentioned in one of his emails about a "Sockets can also exist without any referencing process (if the application closes, but there is still data draining on an open socket)." Now, that makes sense to me, I can understand that ... but, how would that look as far as netstat -nA shows? Or, would it? For example, I have: mars# netstat -nA | grep c9655a20 c9655a20 stream 0 0 0 c95d63f0 0 0 c95d63f0 stream 0 0 0 c9655a20 0 0 mars# netstat -nA | grep c95d63f0 c9655a20 stream 0 0 0 c95d63f0 0 0 c95d63f0 stream 0 0 0 c9655a20 0 0 They are attached to each other, but there appears to be no 'referencing process' ... it is now 10pm at night ... I saved a 'snapshot' of netstat -nA output at 6:45pm, over 3 hours ago, and it has the same entries as above: c9655a20 stream 0 0 0 c95d63f0 0 0 c95d63f0 stream 0 0 0 c9655a20 0 0 again, if I'm reading this right, there is no 'referencing process' ... first, of course, am I reading this right? second ... if I am reading this right, and, if I am understanding what Robert was saying about 'draining' (alot of ifs, I know) ... isn't it odd for it to take >3 hours to drain? Again, if I'm reading / understanding things right, without the 'referencing process', it won't show up in sockstat -u, which is why my netstat -nA numbers keep growing, but sockstat -u numbers don't ... which also means that there is no way to figure out what process / program is leaving 'dangling sockets'? :( - ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFGOoe94QvfyHIvDvMRAj2LAKDXobcYr4VGOB+WfXYqCBTatZNZLQCfbyWa zsG/o1K3RM3ybjA5RLiSW5s=8DJi -----END PGP SIGNATURE-----
Matthew Dillon
2007-May-04 01:26 UTC
Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
:I'm trying to probe this as well as I can, but network stacks and sockets have :never been my strong suit ... : :Robert had mentioned in one of his emails about a "Sockets can also exist :without any referencing process (if the application closes, but there is still :data draining on an open socket)." : :Now, that makes sense to me, I can understand that ... but, how would that look :as far as netstat -nA shows? Or, would it? For example, I have: : :... Netstat should show any sockets, whether they are attached to processes or not. Usually you can match up the address from netstat -nA with the addresses from sockets shown by fstat to figure out what processes the sockets are attached to. There are three situations that you have to watch out for: (1) The socket was close()'d and is still draining. The socket will timeout and terminate within ~1-5 minutes. It will not be referenced to a descriptor or process. (2) The socket descriptor itself has been sent over a unix domain socket from one process to another and is currently in transit. The file pointer representing the descriptor is what is actually in transit, and will not be referenced by any processes while it is in transit. There is a garbage collector that figures out unreferencable loops. I think its called unp_gc or something like that. (3) The socket is not closed, but is idle (like having a remote shell open and never typing in it). Service processes can get stuck waiting for data on such sockets. The socket WILL be referenced by some process. These are controlled by net.inet.tcp.keep* and net.inet.tcp.always_keepalive. I almost universally turn on net.inet.tcp.always_keepalive to ensure that dead idle connections get cleaned out. Note that keepalive only applies to idle connections. A socket that has been closed and needs to drain (either data or the FIN state) will timeout and clean up itself whether keepalive is turned on or off). netstat -nA will give you the status of all your sockets. You can observe the state of any TCP sockets. Unix domain sockets have no state and closure is governed simply by them being dereferenced, just like a pipe. In this case there are really only two situations: (1) One end of the unix domain socket is still referenced by a process or (2) The socket has been sent over another unix domain socket and is 'in transit'. The socket will remain intact until it is either no longer in transit (read out from the other unix domain socket), or the garbage collector determines that the socket the descripor is transiting over is not externally referencablee, and will destroy it and any in-transit sockets contained within. Any sockets that don't fall into these categories are in trouble... either a timer has failed somewhere or (if unix domain) the garbage collector has failed to detect that it is in an unreferencable loop. - One thing you can do is drop into single user mode... kill all the processes on the system, and see if the sockets are recovered. That will give you a good idea as to whether it is a real leak or whether some process is directly or indirectly (by not draining a unix domain socket on which other sockets are being transfered) holding onto the socket. -Matt
Ian Smith
2007-May-04 06:37 UTC
Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
On Thu, 3 May 2007, Marc G. Fournier wrote: > Robert had mentioned in one of his emails about a "Sockets can also exist > without any referencing process (if the application closes, but there is still > data draining on an open socket)." [..] > Again, if I'm reading / understanding things right, without the 'referencing > process', it won't show up in sockstat -u, which is why my netstat -nA numbers > keep growing, but sockstat -u numbers don't ... which also means that there is > no way to figure out what process / program is leaving 'dangling sockets'? :( Marc, I don't know if it may provide any more clues in this instance, but lsof -U also shows unix domain sockets with pid, command and fd. Cheers, Ian
Robert Watson
2007-May-04 11:05 UTC
Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
On Thu, 3 May 2007, Marc G. Fournier wrote:> I'm trying to probe this as well as I can, but network stacks and sockets > have never been my strong suit ... > > Robert had mentioned in one of his emails about a "Sockets can also exist > without any referencing process (if the application closes, but there is > still data draining on an open socket)." > > Now, that makes sense to me, I can understand that ... but, how would that > look as far as netstat -nA shows? Or, would it? For example, I have: > > mars# netstat -nA | grep c9655a20 > c9655a20 stream 0 0 0 c95d63f0 0 0 > c95d63f0 stream 0 0 0 c9655a20 0 0 > mars# netstat -nA | grep c95d63f0 > c9655a20 stream 0 0 0 c95d63f0 0 0 > c95d63f0 stream 0 0 0 c9655a20 0 0 > > They are attached to each other, but there appears to be no 'referencing > process' ... it is now 10pm at night ... I saved a 'snapshot' of netstat -nA > output at 6:45pm, over 3 hours ago, and it has the same entries as above: > > c9655a20 stream 0 0 0 c95d63f0 0 0 > c95d63f0 stream 0 0 0 c9655a20 0 0 > > again, if I'm reading this right, there is no 'referencing process' ... > first, of course, am I reading this right? > > second ... if I am reading this right, and, if I am understanding what > Robert was saying about 'draining' (alot of ifs, I know) ... isn't it odd > for it to take >3 hours to drain? > > Again, if I'm reading / understanding things right, without the 'referencing > process', it won't show up in sockstat -u, which is why my netstat -nA > numbers keep growing, but sockstat -u numbers don't ... which also means > that there is no way to figure out what process / program is leaving > 'dangling sockets'? :(I think we should be careful to avoid prematurely drawing conclusions about the source of the problem. First question: have you confirmed that the resource limit on sockets is definitely what is causing the error you're seeing? I.e., does the number of sockets hit the maximum sockets? Second point: there are two kinds of resource leaks that seem likely candidates for a socket resource exhaustion problem. First, kernel bugs, in which the kernel maintains objects despite there being no application references, and second, application reference leaks, in which applications keep references to kernel objects despite no longer needing them. Our immediate goal is to determine which of these is the case: is it a kernel bug, or an application bug? Using tools like netstat and sockstat, we can try and determine if all kernel sockets are properly referenced. Experience suggests that it is an application bug, but we shouldn't rule out a kernel bug; the good news is that the tools to use in the debugging process are identical at this stage. Robert N M Watson Computer Laboratory University of Cambridge
Oliver Fromme
2007-May-07 17:01 UTC
Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
Marc G. Fournier wrote: > Now, that makes sense to me, I can understand that ... but, how would > that look as far as netstat -nA shows? Or, would it? For example, I > have: You should use "-na" to list all sockets, not "-nA". > mars# netstat -nA | grep c9655a20 > c9655a20 stream 0 0 0 c95d63f0 0 0 > c95d63f0 stream 0 0 0 c9655a20 0 0 > mars# netstat -nA | grep c95d63f0 > c9655a20 stream 0 0 0 c95d63f0 0 0 > c95d63f0 stream 0 0 0 c9655a20 0 0 > > They are attached to each other, but there appears to be no 'referencing > process' netstat doesn't show processes at all (sockstat, fstat and lsof list sockets by processes). The sockets above are probably from a socketpair(2) or a pipe (which is implemented with socketpair(2), AFAIK). That's perfectly normal. If I remember correctly, you wrote that 11k sockets are in use with 90 jails. That's about 120 sockets per jail, which isn't out of the ordinary. Of course it depends on what is running in those jails, but my guess is that you just need to increase the limit on the number of sockets (i.e. kern.ipc.maxsockets). > Again, if I'm reading / understanding things right, without the 'referencing > process', it won't show up in sockstat -u, which is why my netstat -nA numbers > keep growing, but sockstat -u numbers don't ... which also means that there is > no way to figure out what process / program is leaving 'dangling sockets'? :( Be careful here, sockstat's output is process-based and lists sockets multiple times. For example, the server sockets that httpd children inherit from their parent are listed for every single child, while you see it only once in the netstat output. On the other hand, sockstat doesn't show sockets that have been closed and are in TIME_WAIT state or similar. Are you sure that UNIX domain sockets are causing the problem? Can you rule out other sockets (e.g. tcp)? In that case you should run "netstat -funix" to list only UNIX domain sockets (basically the same as the -u option to sockstat). Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Gesch?ftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht M?n- chen, HRB 125758, Gesch?ftsf?hrer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd $ dd if=/dev/urandom of=test.pl count=1 $ file test.pl test.pl: perl script text executable