Jürgen Keil
2006-Mar-22 10:25 UTC
[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298
Recently this kernel:zfs bug was filed: Bug: 6399298 Synopsis: system seems to be spinning in dnlc_reduce_cache() http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6399298 ... with a long analysis of a kernel crash dump. I seems the root of the problem is that DNLC''s nc_hash chains are all empty, but dnlc_nentries == 1, and dnlc_reduce_cache() is called trying to reduce the dnlc cache down to 0 entries. This gets stuck in an endless loop inside dnlc_reduce_cache(). As far as I understand it, the explanation is quite simple (and isn''t visible on bugs.opensolaris.org): ---------------------------------------------------------------------------------------------------------------------------- dnlc_get(): - allocates memory for new dnlc entry increments dnlc_nentries by 1 but does not yet add the new dnlc entry to the hash table (that happens at a later time, by calling dnlc_enter()) (or in other words: dnlc_nentries is the number of *allocated* dnlc_entries, not the number of entries inserted into the nc_hash table) - assuming the dnlc was completely empty and dnlc_nentries == 0 when we called dnlc_get: on return, dnlc_nentries == 1 and all nc_hash[] chains are empty dnlc_reduce_cache(): - when dnlc_nentries == 1 and all nc_hash[] chains are empty: The following loop is an endless loop (dnlc_nentries == 1, low_water == 0): /* * Find the first non empty hash queue without locking * Recheck we really have entries to avoid * an infinite loop if all the entries get purged. */ do { if (++hp == &nc_hash[nc_hashsz]) { hp = nc_hash; if (dnlc_nentries <= low_water) { dnlc_reduce_idle = 1; return; } } } while (hp->hash_next == (ncache_t *)hp); Suggested fix: ============= Scan nc_hash[] queues and if all are empty, return from dnlc_reduce_cache() --- ../opensolaris-20060313/usr/src/uts/common/fs/dnlc.c 2006-03-14 21:05:00.000000000 +0100 +++ usr/src/uts/common/fs/dnlc.c 2006-03-21 22:43:42.350917077 +0100 @@ -1009,7 +1009,7 @@ void dnlc_reduce_cache(void *reduce_percent) { - nc_hash_t *hp = dnlc_free_rotor; + nc_hash_t *end, *hp = dnlc_free_rotor; vnode_t *vp; ncache_t *ncp; int cnt; @@ -1032,15 +1032,15 @@ * Recheck we really have entries to avoid * an infinite loop if all the entries get purged. */ + end = hp; do { - if (++hp == &nc_hash[nc_hashsz]) { + if (++hp == &nc_hash[nc_hashsz]) hp = nc_hash; - if (dnlc_nentries <= low_water) { - dnlc_reduce_idle = 1; - return; - } - } - } while (hp->hash_next == (ncache_t *)hp); + } while (hp->hash_next == (ncache_t *)hp && hp != end); + if (hp->hash_next == (ncache_t *)hp) { + dnlc_reduce_idle = 1; + return; + } mutex_enter(&hp->hash_lock); for (cnt = 0, ncp = hp->hash_prev; ncp != (ncache_t *)hp; This message posted from opensolaris.org
eric kustarz
2006-Mar-22 17:41 UTC
[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298
J?rgen Keil wrote:>Recently this kernel:zfs bug was filed: > >Bug: 6399298 >Synopsis: system seems to be spinning in dnlc_reduce_cache() >http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6399298 > >... with a long analysis of a kernel crash dump. I seems the root of the problem is that DNLC''s >nc_hash chains are all empty, but dnlc_nentries == 1, and dnlc_reduce_cache() is called >trying to reduce the dnlc cache down to 0 entries. This gets stuck in an endless loop inside >dnlc_reduce_cache(). > >Nice analysis... soon we''ll have to put you on the payroll!... I have changes to address this bug going back this week... comments below...>As far as I understand it, the explanation is quite simple (and isn''t visible on >bugs.opensolaris.org): > >---------------------------------------------------------------------------------------------------------------------------- > >dnlc_get(): > >- allocates memory for new dnlc entry > increments dnlc_nentries by 1 > > but does not yet add the new dnlc entry to the hash table > (that happens at a later time, by calling dnlc_enter()) > > (or in other words: dnlc_nentries is the number of *allocated* dnlc_entries, > not the number of entries inserted into the nc_hash table) > >Almost - it really is the number of inserted entries - its just that the incrementing/decrementing of dnlc_nentries and insertion/deletion into a hash queue don''t happen at exactly the same time (say under a lock). You''re right that dnlc_get() will bump dnlc_nentries and *not* insert it - however, the two callers (dnlc_enter() and dnlc_update()) either nc_inshash() or dnlc_free() the new entry. nc_inshash() will (as you can guess) puts it in a hash queue - dnlc_free() will decrement dnlc_nentries.>- assuming the dnlc was completely empty and dnlc_nentries == 0 when we called dnlc_get: > on return, dnlc_nentries == 1 > and all nc_hash[] chains are empty > > >Its possible for the state to *temporary* exist like that... or its also possible for the *temporary* state to exist where we call dnlc_remove() and remove the entry from the hash_queue, call dnlc_free() which then gets interrupted before decrementing dnlc_nentries. The latter was what was actually happening. Note its a definite bug for dnlc_nentries to be != 0 and the hash_queues to be empty after dnlc_update(), dnlc_enter(), dnlc_remove(), etc. finish.>dnlc_reduce_cache(): > >- when dnlc_nentries == 1 and all nc_hash[] chains are empty: > > The following loop is an endless loop (dnlc_nentries == 1, low_water == 0): > > /* > * Find the first non empty hash queue without locking > * Recheck we really have entries to avoid > * an infinite loop if all the entries get purged. > */ > do { > if (++hp == &nc_hash[nc_hashsz]) { > hp = nc_hash; > if (dnlc_nentries <= low_water) { > dnlc_reduce_idle = 1; > return; > } > } > } while (hp->hash_next == (ncache_t *)hp); > > >Suggested fix: >=============> >Scan nc_hash[] queues and if all are empty, return from dnlc_reduce_cache() > >--- ../opensolaris-20060313/usr/src/uts/common/fs/dnlc.c 2006-03-14 21:05:00.000000000 +0100 >+++ usr/src/uts/common/fs/dnlc.c 2006-03-21 22:43:42.350917077 +0100 >@@ -1009,7 +1009,7 @@ > void > dnlc_reduce_cache(void *reduce_percent) > { >- nc_hash_t *hp = dnlc_free_rotor; >+ nc_hash_t *end, *hp = dnlc_free_rotor; > vnode_t *vp; > ncache_t *ncp; > int cnt; >@@ -1032,15 +1032,15 @@ > * Recheck we really have entries to avoid > * an infinite loop if all the entries get purged. > */ >+ end = hp; > do { >- if (++hp == &nc_hash[nc_hashsz]) { >+ if (++hp == &nc_hash[nc_hashsz]) > hp = nc_hash; >- if (dnlc_nentries <= low_water) { >- dnlc_reduce_idle = 1; >- return; >- } >- } >- } while (hp->hash_next == (ncache_t *)hp); >+ } while (hp->hash_next == (ncache_t *)hp && hp != end); >+ if (hp->hash_next == (ncache_t *)hp) { >+ dnlc_reduce_idle = 1; >+ return; >+ } > > mutex_enter(&hp->hash_lock); > for (cnt = 0, ncp = hp->hash_prev; ncp != (ncache_t *)hp; > >This actually is part of the real fix that i have. So algorithmically, it doesn''t make sense to keep looking endlessly for an empty hash queue - its just not critical. We should look at each hash queue once and only once. The deeper problem that Ed noticed was that dnlc_nentries wasn''t qualified as volatile. So the compiler was optimizing the "if (dnlc_nentries <= low_water)" check before/outside of the do/while loop - so even if the interrupted thread finished and decremented dnlc_nentries back to 0, this "if" check would never be executed and we''d still loop forever. The other part of the fix is to never try to reduce that low - that was something new i added. The normal way into dnlc_reduce_cache() via dnlc_get() will try to reduce to 99% of ncsize - that won''t change. However, i''m changing the new path of directly calling dnlc_reduce_cache() will try to reduce down to at most 3% of ncsize. keep up the good analysis, eric>This message posted from opensolaris.org >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
Al Hopper
2006-Mar-22 19:47 UTC
[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298
On Wed, 22 Mar 2006, eric kustarz wrote: .... snip - lots of Good Stuff deleted ...> The other part of the fix is to never try to reduce that low - that was > something new i added. The normal way into dnlc_reduce_cache() via > dnlc_get() will try to reduce to 99% of ncsize - that won''t change. > However, i''m changing the new path of directly calling > dnlc_reduce_cache() will try to reduce down to at most 3% of ncsize.Where does the 3% number come from? (I''m curious). Is there any attempt to preserve specific dnlc entries - possibly by using a LRU type algorithm? Or entries that have ''matched'' in the past? Would the application of such an algorithm make sense? The use case I have in mind is something that a developer does many times a day. The generic form looks like: find . -name ''*.java'' |xargs grep -i whereIsIt This is a very expensive operation if the dnlc caches have been emptied. And its not unusual to also ''find /usr -type f ....''. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
eric kustarz
2006-Mar-22 20:02 UTC
[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298
Al Hopper wrote:>On Wed, 22 Mar 2006, eric kustarz wrote: > >.... snip - lots of Good Stuff deleted ... > > > >>The other part of the fix is to never try to reduce that low - that was >>something new i added. The normal way into dnlc_reduce_cache() via >>dnlc_get() will try to reduce to 99% of ncsize - that won''t change. >>However, i''m changing the new path of directly calling >>dnlc_reduce_cache() will try to reduce down to at most 3% of ncsize. >> >> > >Where does the 3% number come from? (I''m curious). >Is there any attempt to preserve specific dnlc entries - possibly by using >a LRU type algorithm? Or entries that have ''matched'' in the past? >Would the application of such an algorithm make sense? > >The use case I have in mind is something that a developer does many times a >day. The generic form looks like: >find . -name ''*.java'' |xargs grep -i whereIsIt > >This is a very expensive operation if the dnlc caches have been emptied. >And its not unusual to also ''find /usr -type f ....''. > > >We''re deciding on what the actual value will be - 3% or something else. Note, reducing only down to 3% (or whatever it will be) *only* happens when ZFS detects very low memory - as dictated by arc_reclaim_needed(): if (freemem < lotsfree + needfree + extra) or if (availrmem < swapfs_minfree + swapfs_reserve + extra) or #if defined(__i386) if (btop(vmem_size(heap_arena, VMEM_FREE)) < (btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2)) return (1); #endif Mark is currently refining the last case. You''re "find" case should not be affected by this code path. The dnlc is sized by number of entries, as opposed to actual bytes of memory used (as ZFS''s ARC is). This is because the dnlc just VN_HOLDs a vnode, and its file system specific on how many bytes that actual hold will do. As it turns out, its more expensive to hold a znode than an inode, so we have code in zfs to free up memory when low memory is detected (part of that is releasing dnlc entries). And it wouldn''t necessarily be a bad thing if ufs or nfs did a similar thing. Entries are kicked out via dnlc_reduce_cache() - it has rules to kick out entries of v_count''s of 1, negative cache entries, and long chains first. If those don''t suffice, it picks off the end of the chain. And when entries are added (via nc_inshash() ), they are placed at the front of the chain. eric
Jürgen Keil
2006-Mar-23 11:15 UTC
[zfs-discuss] Re: zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399
> Note, reducing only down to 3% (or whatever it will > be) *only* happens when ZFS detects very low memory - as dictated by > arc_reclaim_needed():....> #if defined(__i386) > if (btop(vmem_size(heap_arena, VMEM_FREE)) < (btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2)) > return (1); > #endif > > Mark is currently refining the last case.Problem is that on 32-bit x86 we can enter this ">75% heap_arena full" condition and zfs'' arc cache basically prevents the kernel''s kmem_cache allocators to return most of the unused free space to the heap arena, so that we remain at >75% heap full all the time. A side effect is that zfs tries to shrink it''s arc buffer cache more and more, and at the same time reduces the dnlc cache by 3% over and over again. In the end zfs is running with almost no buffer cache, and the dnlc contains only a couple of entries (<50), and the system is reading from the disk like crazy. This is bug 6397610 / 6398177. See also this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=6882&tstart=0> You''re "find" case should not be affected by this code path.I''m not so sure here.... The best way (that I''ve found so far) to reproduce bugs 6397610 / 6398177 is a recursive diff for two sets of opensolaris sources "gdiff -ru opensolaris-2006XXYY/usr opensolaris-2006XXZZ/usr " on a 32-bit x86 machine, followed by starting a "nightly -i opensolaris.sh" incremental build. The recursive diff should be *very* similar to the "find . -name ''*.java'' | xargs grep -i whereIsIt" case. The diff adds lots of entries to the kernel''s dnode_t dmu_buf_impl_t and zfs_znode_cache kmem_caches. And starting the nightly build needs more space in other kmem areas, the ">75% heap full" condition is triggered, some zio_bufs and arc_buf* are returned to the heap arena. As a side effect dnode_t, dmu_buf_impl_t and zfs_znode_cache entries are freed (entries are intended to be reused in the same kmem_cache) but this free space is *not* given back to the heap arena. There are cases where zfs thinks the kernel''s heap is full, but the kmem_caches are sitting on > 60 mbytes of free memory. This message posted from opensolaris.org
Jonathan Adams
2006-Mar-23 16:31 UTC
[zfs-discuss] Re: zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399
On Thu, Mar 23, 2006 at 03:15:06AM -0800, J?rgen Keil wrote:> The diff adds lots of entries to the kernel''s dnode_t dmu_buf_impl_t > and zfs_znode_cache kmem_caches. > > And starting the nightly build needs more space in other kmem areas, > the ">75% heap full" condition is triggered, some zio_bufs and > arc_buf* are returned to the heap arena. As a side effect dnode_t, > dmu_buf_impl_t and zfs_znode_cache entries are freed (entries are > intended to be reused in the same kmem_cache) but this free space is > *not* given back to the heap arena.The problem is that there are 9 dmu_buf_impl_ts per "slab", and 17 dnode''s per "slab"; a slab cannot be freed until *all* of its nodes are free, and not cached in magazines. I''ve heard this referred to as a slab being "held hostage" by an outstanding allocation. kmem_reap() flushes the magazines to the slabs (except for a 15-second working set). Usually, this frees up memory, but there aren''t any guarantees. It might make sense to invent a "desperate" reaping mode, which kicks in only when the kmem_reap fails to free up any memory. But this is still not guaranteed to free up more memory, so it might just slow things down more when we''re already under heavy load. Cheers, - jonathan> There are cases where zfs thinks the kernel''s heap is full, but the > kmem_caches are sitting on > 60 mbytes of free memory. > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Jonathan Adams, Solaris Kernel Development