Hi all, I'm having some trouble with some production 8.2-RELEASE servers where the 'Active' and 'Inact' memory values reported by top don't seem to correspond with the processes which are running on the machine. I have two near-identical machines (with slightly different workloads); on one, let's call it A, active + free is small (6.5G) and on the other (B) active + free is large (13.6G), even though they have almost identical sums-of-resident memory (8.3G on A and 9.3G on B). The only difference is that A has a smaller number of quite long-running processes (it's hosting a small number of busy sites) and B has a larger number of more frequently killed/recycled processes (it's hosting a larger number of quiet sites, so the FastCGI processes get killed and restarted frequently). Notably B has many more ZFS filesystems mounted than A (around 4,000 versus 100). The machines are otherwise under similar amounts of load. I hoped that the community could please help me understand what's going on with respect to the worryingly large amount of active + free memory on B. Both machines are ZFS-on-root with FreeBSD 8.2-RELEASE with uptimes around 5-6 days. I have recently reduced the ARC cache on both machines since my previous thread [1] and Wired memory usage is now stable at 6G on A and 7G on B with an arc_max of 4G on both machines. Neither of the machines have any swap in use: Swap: 10G Total, 10G Free My current (probably quite simplistic) understanding of the FreeBSD virtual memory system is that, for each process as reported by top: * Size corresponds to the total size of all the text pages for the process (those belonging to code in the binary itself and linked libraries) plus data pages (including stack and malloc()'d but not-yet-written-to memory segments). * Resident corresponds to a subset of the pages above: those pages which actually occupy physical/core memory. Notably pages may appear in size but not appear in resident for read-only text pages from libraries which have not been used yet or which have been malloc()'d but not yet written-to. My understanding for the values for the system as a whole (at the top in 'top') is as follows: * Active / inactive memory is the same thing: resident memory from processes in use. Being in the inactive as opposed to active list simply indicates that the pages in question are less recently used and therefore more likely to get swapped out if the machine comes under memory pressure. * Wired is mostly kernel memory. * Cache is freed memory which the kernel has decided to keep in case it correspond to a useful page in future; it can be cheaply evicted into the free list. * Free memory is actually not being used for anything. It seems that pages which occur in the active + inactive lists must occur in the resident memory of one or more processes ("or more" since processes can share pages in e.g. read-only shared libs or COW forked address space). Conversely, if a page *does not* occur in the resident memory of any process, it must not occupy any space in the active + inactive lists. Therefore the active + inactive memory should always be less than or equal to the sum of the resident memory of all the processes on the system, right? But it's not. So, I wrote a very simple Python script to add up the resident memory values in the output from 'top' and, on machine A: Mem: 3388M Active, 3209M Inact, 6066M Wired, 196K Cache, 11G Free There were 246 processes totalling 8271 MB resident memory Whereas on machine B: Mem: 11G Active, 2598M Inact, 7177M Wired, 733M Cache, 1619M Free There were 441 processes totalling 9297 MB resident memory Now, on machine A: 3388M active + 3209M inactive - 8271M sum-of-resident = -1674M I can attribute this negative value to shared libraries between the running processes (which the sum-of-res is double-counting but active + inactive is not). But on machine B: 11264M active + 2598M inactive - 9297M sum-of-resident = 4565M I'm struggling to explain how, when there are only 9.2G (worst case, discounting shared pages) of resident processes, the system is using 11G + 2598M = 13.8G of memory! This "missing memory" is scary, because it seems to be increasing over time, and eventually when the system runs out of free memory, I'm certain it will crash in the same way described in my previous thread [1]. Is my understanding of the virtual memory system badly broken - in which case please educate me ;-) or is there a real problem here? If so how can I dig deeper to help uncover/fix it? Best Regards, Luke Marsden [1] lists.freebsd.org/pipermail/freebsd-fs/2012-February/013775.html [2] https://gist.github.com/1988153 -- CTO, Hybrid Logic +447791750420 | +1-415-449-1165 | www.hybrid-cluster.com
Thanks for your email, Chuck.> > Conversely, if a page *does not* occur in the resident > > memory of any process, it must not occupy any space in the active + > > inactive lists. > > Hmm...if a process gets swapped out entirely, the pages for it will be moved > to the cache list, flushed, and then reused as soon as the disk I/O completes. > But there is a window where the process can be marked as swapped out (and > considered no longer resident), but still has some of it's pages in physical > memory.There's no swapping happening on these machines (intentionally so, because as soon as we hit swap everything goes tits up), so this window doesn't concern me. I'm trying to confirm that, on a system with no pages swapped out, that the following is a true statement: a page is accounted for in active + inactive if and only if it corresponds to one or more of the pages accounted for in the resident memory lists of all the processes on the system (as per the output of 'top' and 'ps')> > Therefore the active + inactive memory should always be less than or > > equal to the sum of the resident memory of all the processes on the > > system, right? > > No. If you've got a lot of process pages shared (ie, a webserver with lots of > httpd children, or a database pulling in a large common shmem area), then your > process resident sizes can be very large compared to the system-wide > active+inactive count.But that's what I'm saying... sum(process resident sizes) >= active + inactive Or as I said it above, equivalently: active + inactive <= sum(process resident sizes) The data I've got from this system, and what's killing us, shows the opposite: active + inactive > sum(process resident sizes) - by over 5GB now and growing, which is what keeps causing these machines to crash. In particular: Mem: 13G Active, 1129M Inact, 7543M Wired, 120M Cache, 1553M Free But the total sum of resident memories is 9457M (according to summing the output from ps or top). 13G + 1129M = 14441M (active + inact) > 9457M (sum of res) That's 4984M out, and that's almost enough to push us over the edge. If my understanding of VM is correct, I don't see how this can happen. But it's happening, and it's causing real trouble here because our free memory keeps hitting zero and then we swap-spiral. What can I do to investigate this discrepancy? Are there some tools that I can use to debug the memory allocated in "active" to find out where it's going, if not to resident process memory? Thanks, Luke -- CTO, Hybrid Logic +447791750420 | +1-415-449-1165 | www.hybrid-cluster.com
On Tue, 2012-03-06 at 19:13 +0000, Luke Marsden wrote:> Hi all, > > I'm having some trouble with some production 8.2-RELEASE servers where > the 'Active' and 'Inact' memory values reported by top don't seem to > correspond with the processes which are running on the machine. I have > two near-identical machines (with slightly different workloads); on one, > let's call it A, active + free is small (6.5G) and on the other (B) > active + free is large (13.6G), even though they have almost identical > sums-of-resident memory (8.3G on A and 9.3G on B). > > The only difference is that A has a smaller number of quite long-running > processes (it's hosting a small number of busy sites) and B has a larger > number of more frequently killed/recycled processes (it's hosting a > larger number of quiet sites, so the FastCGI processes get killed and > restarted frequently). Notably B has many more ZFS filesystems mounted > than A (around 4,000 versus 100). The machines are otherwise under > similar amounts of load. I hoped that the community could please help > me understand what's going on with respect to the worryingly large > amount of active + free memory on B. > > Both machines are ZFS-on-root with FreeBSD 8.2-RELEASE with uptimes > around 5-6 days. I have recently reduced the ARC cache on both machines > since my previous thread [1] and Wired memory usage is now stable at 6G > on A and 7G on B with an arc_max of 4G on both machines. > > Neither of the machines have any swap in use: > > Swap: 10G Total, 10G Free > > My current (probably quite simplistic) understanding of the FreeBSD > virtual memory system is that, for each process as reported by top: > > * Size corresponds to the total size of all the text pages for the > process (those belonging to code in the binary itself and linked > libraries) plus data pages (including stack and malloc()'d but > not-yet-written-to memory segments). > * Resident corresponds to a subset of the pages above: those pages > which actually occupy physical/core memory. Notably pages may > appear in size but not appear in resident for read-only text > pages from libraries which have not been used yet or which have > been malloc()'d but not yet written-to. > > My understanding for the values for the system as a whole (at the top in > 'top') is as follows: > > * Active / inactive memory is the same thing: resident memory from > processes in use. Being in the inactive as opposed to active > list simply indicates that the pages in question are less > recently used and therefore more likely to get swapped out if > the machine comes under memory pressure. > * Wired is mostly kernel memory. > * Cache is freed memory which the kernel has decided to keep in > case it correspond to a useful page in future; it can be cheaply > evicted into the free list. > * Free memory is actually not being used for anything. > > It seems that pages which occur in the active + inactive lists must > occur in the resident memory of one or more processes ("or more" since > processes can share pages in e.g. read-only shared libs or COW forked > address space). Conversely, if a page *does not* occur in the resident > memory of any process, it must not occupy any space in the active + > inactive lists. > > Therefore the active + inactive memory should always be less than or > equal to the sum of the resident memory of all the processes on the > system, right? > > But it's not. So, I wrote a very simple Python script to add up the > resident memory values in the output from 'top' and, on machine A: > > Mem: 3388M Active, 3209M Inact, 6066M Wired, 196K Cache, 11G > Free > There were 246 processes totalling 8271 MB resident memory > > Whereas on machine B: > > Mem: 11G Active, 2598M Inact, 7177M Wired, 733M Cache, 1619M > Free > There were 441 processes totalling 9297 MB resident memory > > Now, on machine A: > > 3388M active + 3209M inactive - 8271M sum-of-resident = -1674M > > I can attribute this negative value to shared libraries between the > running processes (which the sum-of-res is double-counting but active + > inactive is not). But on machine B: > > 11264M active + 2598M inactive - 9297M sum-of-resident = 4565M > > I'm struggling to explain how, when there are only 9.2G (worst case, > discounting shared pages) of resident processes, the system is using 11G > + 2598M = 13.8G of memory! > > This "missing memory" is scary, because it seems to be increasing over > time, and eventually when the system runs out of free memory, I'm > certain it will crash in the same way described in my previous thread > [1]. > > Is my understanding of the virtual memory system badly broken - in which > case please educate me ;-) or is there a real problem here? If so how > can I dig deeper to help uncover/fix it? > > Best Regards, > Luke Marsden > > [1] lists.freebsd.org/pipermail/freebsd-fs/2012-February/013775.html > [2] https://gist.github.com/1988153 >In my experience, the bulk of the memory in the inactive category is cached disk blocks, at least for ufs (I think zfs does things differently). On this desktop machine I have 12G physical and typically have roughly 11G inactive, and I can unmount one particular filesystem where most of my work is done and instantly I have almost no inactive and roughly 11G free. -- Ian
On Wed, 07 Mar 2012 10:23:38 +0200, Konstantin Belousov wrote:> On Wed, Mar 07, 2012 at 12:36:21AM +0000, Luke Marsden wrote: > ... >> I'm trying to confirm that, on a system with no pages swapped out, that >> the following is a true statement: >> >> a page is accounted for in active + inactive if and only if it >> corresponds to one or more of the pages accounted for in the >> resident memory lists of all the processes on the system (as >> per the output of 'top' and 'ps') > No. > > The pages belonging to vnode vm object can be active or inactive or > cached but not mapped into any process address space.I wonder if some ideas by Denys Vlasenko contained in this thread http://comments.gmane.org/gmane.linux.redhat.fedora.devel/157706 would be useful ? ... "Today, I'm looking at my process list, sorted by amount of dirtied pages (which very closely matches amount of malloced and used space - that is, malloced, but not-written to memory areas are not included). This is the most expensive type of pages, they can't be discarded. If we would be in memory squeeze, kernel will have to swap them out, if swap exists, otherwise kernel can't do anything at all." ... "Note that any shared pages (such as glibc) are not freed this way; also, non-mapped pages (such as large, but unused malloced space, or large, but unused file mappings) also do not contribute to MemFree increase." jb