thr3ads.net - zfs discuss - [zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce

If this information is useful, please help other people find it:
Share via:

Jürgen Keil

2006-Mar-22 10:25 UTC

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

Recently this kernel:zfs bug was filed:

Bug: 6399298
Synopsis: system seems to be spinning in dnlc_reduce_cache()
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6399298

... with a long analysis of a kernel crash dump.  I seems the root of the
problem is that DNLC''s
nc_hash chains are all empty, but dnlc_nentries == 1, and dnlc_reduce_cache() is
called
trying to reduce the dnlc cache down to 0 entries.  This gets stuck in an
endless loop inside
dnlc_reduce_cache().

As far as I understand it, the explanation is quite simple (and isn''t
visible on
bugs.opensolaris.org):

----------------------------------------------------------------------------------------------------------------------------

dnlc_get():

- allocates memory for new dnlc entry
  increments dnlc_nentries by 1

  but does not yet add the new dnlc entry to the hash table
  (that happens at a later time, by calling dnlc_enter())

  (or in other words: dnlc_nentries is the number of *allocated* dnlc_entries,
  not the number of entries inserted into the nc_hash table)

- assuming the dnlc was completely empty and dnlc_nentries == 0 when we called
dnlc_get:
  on return, dnlc_nentries == 1
  and all nc_hash[] chains are empty


dnlc_reduce_cache():

- when dnlc_nentries == 1 and all nc_hash[] chains are empty:

  The following loop is an endless loop (dnlc_nentries == 1, low_water == 0):

                /*
                 * Find the first non empty hash queue without locking
                 * Recheck we really have entries to avoid
                 * an infinite loop if all the entries get purged.
                 */
                do {
                        if (++hp == &nc_hash[nc_hashsz]) {
                                hp = nc_hash;
                                if (dnlc_nentries <= low_water) {
                                        dnlc_reduce_idle = 1;
                                        return;
                                }
                        }
                } while (hp->hash_next == (ncache_t *)hp);


Suggested fix:
=============
Scan nc_hash[] queues and if all are empty, return from dnlc_reduce_cache()

--- ../opensolaris-20060313/usr/src/uts/common/fs/dnlc.c        2006-03-14
21:05:00.000000000 +0100
+++ usr/src/uts/common/fs/dnlc.c        2006-03-21 22:43:42.350917077 +0100
@@ -1009,7 +1009,7 @@
 void
 dnlc_reduce_cache(void *reduce_percent)
 {
-       nc_hash_t *hp = dnlc_free_rotor;
+       nc_hash_t *end, *hp = dnlc_free_rotor;
        vnode_t *vp;
        ncache_t *ncp;
        int cnt;
@@ -1032,15 +1032,15 @@
                 * Recheck we really have entries to avoid
                 * an infinite loop if all the entries get purged.
                 */
+               end = hp;
                do {
-                       if (++hp == &nc_hash[nc_hashsz]) {
+                       if (++hp == &nc_hash[nc_hashsz])
                                hp = nc_hash;
-                               if (dnlc_nentries <= low_water) {
-                                       dnlc_reduce_idle = 1;
-                                       return;
-                               }
-                       }
-               } while (hp->hash_next == (ncache_t *)hp);
+               } while (hp->hash_next == (ncache_t *)hp && hp !=
end);
+               if (hp->hash_next == (ncache_t *)hp) {
+                       dnlc_reduce_idle = 1;
+                       return;
+               }

                mutex_enter(&hp->hash_lock);
                for (cnt = 0, ncp = hp->hash_prev; ncp != (ncache_t *)hp;
This message posted from opensolaris.org

eric kustarz

2006-Mar-22 17:41 UTC

head link

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

J?rgen Keil wrote:
>Recently this kernel:zfs bug was filed:
>
>Bug: 6399298
>Synopsis: system seems to be spinning in dnlc_reduce_cache()
>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6399298
>
>... with a long analysis of a kernel crash dump.  I seems the root of the
problem is that DNLC''s
>nc_hash chains are all empty, but dnlc_nentries == 1, and
dnlc_reduce_cache() is called
>trying to reduce the dnlc cache down to 0 entries.  This gets stuck in an
endless loop inside
>dnlc_reduce_cache().
>  
>
Nice analysis... soon we''ll have to put you on the payroll!...  I have 
changes to address this bug going back this week... comments below...
>As far as I understand it, the explanation is quite simple (and
isn''t visible on
>bugs.opensolaris.org):
>
>----------------------------------------------------------------------------------------------------------------------------
>
>dnlc_get():
>
>- allocates memory for new dnlc entry
>  increments dnlc_nentries by 1
>
>  but does not yet add the new dnlc entry to the hash table
>  (that happens at a later time, by calling dnlc_enter())
>
>  (or in other words: dnlc_nentries is the number of *allocated*
dnlc_entries,
>  not the number of entries inserted into the nc_hash table)
>  
>
Almost - it really is the number of inserted entries - its just that the 
incrementing/decrementing of dnlc_nentries and insertion/deletion into a 
hash queue don''t happen at exactly the same time (say under a lock).

You''re right that dnlc_get() will bump dnlc_nentries and *not* insert
it
- however, the two callers (dnlc_enter() and dnlc_update()) either 
nc_inshash() or dnlc_free() the new entry.  nc_inshash() will (as you 
can guess) puts it in a hash queue - dnlc_free() will decrement 
dnlc_nentries.

>- assuming the dnlc was completely empty and dnlc_nentries == 0 when we
called dnlc_get:
>  on return, dnlc_nentries == 1
>  and all nc_hash[] chains are empty
>
>  
>
Its possible for the state to *temporary* exist like that... or its also 
possible for the *temporary* state to exist where we call dnlc_remove() 
and remove the entry from the hash_queue, call dnlc_free() which then 
gets interrupted before decrementing dnlc_nentries.  The latter was what 
was actually happening.

Note its a definite bug for dnlc_nentries to be != 0 and the hash_queues 
to be empty after dnlc_update(), dnlc_enter(), dnlc_remove(), etc. finish.
>dnlc_reduce_cache():
>
>- when dnlc_nentries == 1 and all nc_hash[] chains are empty:
>
>  The following loop is an endless loop (dnlc_nentries == 1, low_water ==
0):
>
>                /*
>                 * Find the first non empty hash queue without locking
>                 * Recheck we really have entries to avoid
>                 * an infinite loop if all the entries get purged.
>                 */
>                do {
>                        if (++hp == &nc_hash[nc_hashsz]) {
>                                hp = nc_hash;
>                                if (dnlc_nentries <= low_water) {
>                                        dnlc_reduce_idle = 1;
>                                        return;
>                                }
>                        }
>                } while (hp->hash_next == (ncache_t *)hp);
>
>
>Suggested fix:
>=============>
>Scan nc_hash[] queues and if all are empty, return from dnlc_reduce_cache()
>
>--- ../opensolaris-20060313/usr/src/uts/common/fs/dnlc.c        2006-03-14
21:05:00.000000000 +0100
>+++ usr/src/uts/common/fs/dnlc.c        2006-03-21 22:43:42.350917077 +0100
>@@ -1009,7 +1009,7 @@
> void
> dnlc_reduce_cache(void *reduce_percent)
> {
>-       nc_hash_t *hp = dnlc_free_rotor;
>+       nc_hash_t *end, *hp = dnlc_free_rotor;
>        vnode_t *vp;
>        ncache_t *ncp;
>        int cnt;
>@@ -1032,15 +1032,15 @@
>                 * Recheck we really have entries to avoid
>                 * an infinite loop if all the entries get purged.
>                 */
>+               end = hp;
>                do {
>-                       if (++hp == &nc_hash[nc_hashsz]) {
>+                       if (++hp == &nc_hash[nc_hashsz])
>                                hp = nc_hash;
>-                               if (dnlc_nentries <= low_water) {
>-                                       dnlc_reduce_idle = 1;
>-                                       return;
>-                               }
>-                       }
>-               } while (hp->hash_next == (ncache_t *)hp);
>+               } while (hp->hash_next == (ncache_t *)hp && hp !=
end);
>+               if (hp->hash_next == (ncache_t *)hp) {
>+                       dnlc_reduce_idle = 1;
>+                       return;
>+               }
>
>                mutex_enter(&hp->hash_lock);
>                for (cnt = 0, ncp = hp->hash_prev; ncp != (ncache_t *)hp;
>  
>
This actually is part of the real fix that i have.

So algorithmically, it doesn''t make sense to keep looking endlessly for
an empty hash queue - its just not critical.  We should look at each 
hash queue once and only once.

The deeper problem that Ed noticed was that dnlc_nentries wasn''t 
qualified as volatile.  So the compiler was optimizing the "if 
(dnlc_nentries <= low_water)" check before/outside of the do/while loop 
- so even if the interrupted thread finished and decremented 
dnlc_nentries back to 0, this "if" check would never be executed and 
we''d still loop forever.

The other part of the fix is to never try to reduce that low - that was 
something new i added.  The normal way into dnlc_reduce_cache() via 
dnlc_get() will try to reduce to 99% of ncsize - that won''t change.  
However, i''m changing the new path of directly calling 
dnlc_reduce_cache() will try to reduce down to at most 3% of ncsize.

keep up the good analysis,
eric
>This message posted from opensolaris.org
>_______________________________________________
>zfs-discuss mailing list
>zfs-discuss at opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>  
>

Al Hopper

2006-Mar-22 19:47 UTC

head link

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

On Wed, 22 Mar 2006, eric kustarz wrote:

.... snip - lots of Good Stuff deleted ...
> The other part of the fix is to never try to reduce that low - that was
> something new i added.  The normal way into dnlc_reduce_cache() via
> dnlc_get() will try to reduce to 99% of ncsize - that won''t
change.
> However, i''m changing the new path of directly calling
> dnlc_reduce_cache() will try to reduce down to at most 3% of ncsize.
Where does the 3% number come from? (I''m curious).
Is there any attempt to preserve specific dnlc entries - possibly by using
a LRU type algorithm?  Or entries that have ''matched'' in the
past?
Would the application of such an algorithm make sense?

The use case I have in mind is something that a developer does many times a
day.  The generic form looks like:
find . -name ''*.java'' |xargs grep -i whereIsIt

This is a very expensive operation if the dnlc caches have been emptied.
And its not unusual to also ''find /usr -type f ....''.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005

eric kustarz

2006-Mar-22 20:02 UTC

head link

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

Al Hopper wrote:
>On Wed, 22 Mar 2006, eric kustarz wrote:
>
>.... snip - lots of Good Stuff deleted ...
>
>  
>
>>The other part of the fix is to never try to reduce that low - that was
>>something new i added.  The normal way into dnlc_reduce_cache() via
>>dnlc_get() will try to reduce to 99% of ncsize - that won''t
change.
>>However, i''m changing the new path of directly calling
>>dnlc_reduce_cache() will try to reduce down to at most 3% of ncsize.
>>    
>>
>
>Where does the 3% number come from? (I''m curious).
>Is there any attempt to preserve specific dnlc entries - possibly by using
>a LRU type algorithm?  Or entries that have ''matched'' in
the past?
>Would the application of such an algorithm make sense?
>
>The use case I have in mind is something that a developer does many times a
>day.  The generic form looks like:
>find . -name ''*.java'' |xargs grep -i whereIsIt
>
>This is a very expensive operation if the dnlc caches have been emptied.
>And its not unusual to also ''find /usr -type f ....''.
>
>  
>
We''re deciding on what the actual value will be - 3% or something else.

Note, reducing only down to 3% (or whatever it will be) *only* happens 
when ZFS detects very low memory - as dictated by arc_reclaim_needed():

 if (freemem < lotsfree + needfree + extra)

or

 if (availrmem < swapfs_minfree + swapfs_reserve + extra)

or

#if defined(__i386)
        if (btop(vmem_size(heap_arena, VMEM_FREE)) <
            (btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2))
                return (1);
#endif

Mark is currently refining the last case.

You''re "find" case should not be affected by this code path.

The dnlc is sized by number of entries, as opposed to actual bytes of 
memory used (as ZFS''s ARC is).  This is because the dnlc just VN_HOLDs
a
vnode, and its file system specific on how many bytes that actual hold 
will do.  As it turns out, its more expensive to hold a znode than an 
inode, so we have code in zfs to free up memory when low memory is 
detected (part of that is releasing dnlc entries).  And it wouldn''t 
necessarily be a bad thing if ufs or nfs did a similar thing.

Entries are kicked out via dnlc_reduce_cache() - it has rules to kick 
out entries of v_count''s of 1, negative cache entries, and long chains 
first.  If those don''t suffice, it picks off the end of the chain.  And
when entries are added (via nc_inshash() ), they are placed at the front 
of the chain.

eric

Jürgen Keil

2006-Mar-23 11:15 UTC

head link

[zfs-discuss] Re: zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399

> Note, reducing only down to 3% (or whatever it will
> be) *only* happens when ZFS detects very low memory - as dictated by
> arc_reclaim_needed():
....> #if defined(__i386)
>         if (btop(vmem_size(heap_arena, VMEM_FREE)) <
(btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2))
>                return (1);
> #endif
>
> Mark is currently refining the last case.
Problem is that on 32-bit x86 we can enter this ">75% heap_arena
full" condition
and zfs'' arc cache basically prevents the kernel''s kmem_cache
allocators to return
most of the unused free space to the heap arena, so that we remain at >75%
heap
full all the time.

A side effect is that zfs tries to shrink it''s arc buffer cache more
and more, and at the
same time reduces the dnlc cache by 3% over and over again.  In the end zfs is
running with almost no buffer cache, and the dnlc contains only a couple of
entries (<50),
and the system is reading from the disk like crazy.

This is bug 6397610 / 6398177.

See also this thread:

http://www.opensolaris.org/jive/thread.jspa?threadID=6882&tstart=0


> You''re "find" case should not be affected by this code
path.
I''m not so sure here....

The best way (that I''ve found so far) to reproduce bugs 6397610 /
6398177
is a recursive diff for two sets of opensolaris sources
 "gdiff -ru opensolaris-2006XXYY/usr opensolaris-2006XXZZ/usr " on a
32-bit x86
machine, followed by starting a "nightly -i opensolaris.sh"
incremental build.
The recursive diff should be *very* similar to the "find . -name
''*.java'' | xargs
grep -i whereIsIt" case.


The diff adds lots of entries to the kernel''s dnode_t dmu_buf_impl_t
and
zfs_znode_cache kmem_caches.  

And starting the nightly build needs more space in other kmem areas, the
">75% heap full" condition is triggered, some zio_bufs and arc_buf*
are returned
to the heap arena. As a side effect dnode_t, dmu_buf_impl_t and
zfs_znode_cache entries are freed (entries are intended to be reused in the
same kmem_cache) but this free space is *not* given back to the heap arena. 

There are cases where zfs thinks the kernel''s heap is full, but the
kmem_caches
are sitting on > 60 mbytes of free memory.
This message posted from opensolaris.org

Jonathan Adams

2006-Mar-23 16:31 UTC

head link

[zfs-discuss] Re: zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399

On Thu, Mar 23, 2006 at 03:15:06AM -0800, J?rgen Keil
wrote:> The diff adds lots of entries to the kernel''s dnode_t
dmu_buf_impl_t
> and zfs_znode_cache kmem_caches.
>
> And starting the nightly build needs more space in other kmem areas,
> the ">75% heap full" condition is triggered, some zio_bufs and
> arc_buf* are returned to the heap arena. As a side effect dnode_t,
> dmu_buf_impl_t and zfs_znode_cache entries are freed (entries are
> intended to be reused in the same kmem_cache) but this free space is
> *not* given back to the heap arena.
The problem is that there are 9 dmu_buf_impl_ts per "slab", and 17
dnode''s per "slab";  a slab cannot be freed until *all* of
its nodes
are free, and not cached in magazines.  I''ve heard this referred to as
a slab being "held hostage" by an outstanding allocation.

kmem_reap() flushes the magazines to the slabs (except for a 15-second
working set).  Usually, this frees up memory, but there aren''t any
guarantees.

It might make sense to invent a "desperate" reaping mode, which kicks
in only when the kmem_reap fails to free up any memory.  But this is still
not guaranteed to free up more memory, so it might just slow things down
more when we''re already under heavy load.

Cheers,
- jonathan
> There are cases where zfs thinks the kernel''s heap is full, but
the
> kmem_caches are sitting on > 60 mbytes of free memory.
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Jonathan Adams, Solaris Kernel Development

zfs discuss - Mar 2006 - zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

[zfs-discuss] zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399298

[zfs-discuss] Re: zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399

[zfs-discuss] Re: zfs/arc: system seems to be spinning in dnlc_reduce_cache(), bug 6399