Just had a problem with a box where it looks like it ran out of swap due to a problem process, not a problem. The problem was that it seems the kernel on detecting this starts killing off seeming random processes, the first one being sshd hence making the machine inaccessible. So the question is: Does the kernel kill random processes when out of swap or does it kill any processes that require more memory when out of swap? Which leads to the question would it not be more sensible to kill off the largest process first as its more than likely that it is responsible for the problem? [quote] Apr 10 20:09:25 appledore kernel: pid 414 (sshd), uid 0, was killed: out of swap space [/quote] Steve ===============================================This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone (023) 8024 3137 or return the E.mail to postmaster@multiplay.co.uk.
On Apr 11, 2005, at 12:01 PM, Steven Hartland wrote:> of swap? Which leads to the question would it not be more sensible to > kill off the largest process first as its more than likely that it is > responsible > for the problem? >so when this largest process is your production database server for your e-commerce site, what will you change your recommendation to be? basically, there is no "right" choice of process to kill. a machine that is out of resources is just a bad situation, and the right thing is to try to avoid getting there with careful monitoring and planning. Vivek Khera, Ph.D. +1-301-869-4449 x806
At 2005-04-12 13:52:59+0000, Vivek Khera writes:> > of swap? Which leads to the question would it not be more sensible to > > kill off the largest process first as its more than likely that it is > > responsible > > for the problem? > > > > so when this largest process is your production database server for > your e-commerce site, what will you change your recommendation to be? > > basically, there is no "right" choice of process to kill. a machine > that is out of resources is just a bad situation, and the right thing > is to try to avoid getting there with careful monitoring and planning.The right choice is for mmap() to return ENOMEM, and then for malloc() to return NULL, but almost no operating systems make this choice any more. Nick B
At 2005-04-12 14:26:40+0000, Marc Olzheim writes:> On Tue, Apr 12, 2005 at 03:06:41PM +0100, Nick Barnes wrote: > > The right choice is for mmap() to return ENOMEM, and then for malloc() > > to return NULL, but almost no operating systems make this choice any > > more. > > No, the problem occurs only when previously allocated / mmap()d blocks > are actually used (written) and when the total of virtual memory has > been overcommitted: Physical pages are not allocated to processes at > malloc() time, but at time of first usage (Copy On Write).Yes, implicit in my statement is that the OS shouldn't overcommit. I remember when overcommit was new (maybe 1990), and some Unix (Irix, perhaps, or AIX?) made it switchable. There was a bit of flurry in the OS community, as some people (myself included) felt that the OS shouldn't make promises it couldn't fulfill, and that this "kill a random process" behaviour was more of a bug than a solution. Consider a parallel design which allows (say) file descriptors to be overcommitted. You can open a billion files, but if you touch one of them, that consumes a finite kernel resource, and if the kernel has run out then a randomly chosen process gets killed. Great.> many programs have been programmed in a way that assumes this > behaviour, for instance by sparsely using large allocations instead > of adding the possible extra bookkeeping to allow for smaller > allocations.This is the well-known problem with my fantasy world in which the OS doesn't overcommit any resources. All those programs are broken, but it's too costly to fix them. If overcommit had been resisted more effectively in the first place, those programs would have been written properly. My recollection, quite possibly faulty, is that FreeBSD came quite late to the overcommit binge party. Nick B
At 2005-04-12 18:17:32+0000, Matthias Buelow writes:> This stuff has been discussed in the past.Indeed. For a couple of examples from the days before BSD systems got overcommit, see these threads from 1990 and 1991: <http://groups-beta.google.com/group/comp.unix.aix/browse_frm/thread/91541dbf6b658465/4c590978f1001507?q=overcommit&rnum=14#4c590978f1001507> <http://groups-beta.google.com/group/comp.unix.aix/browse_frm/thread/38c9bb9996d30eb1/e8c30f78c44a3f62?q=overcommit&rnum=12#e8c30f78c44a3f62> Nick B