Jürgen Keil
2006-Mar-23 18:15 UTC
[zfs-discuss] Re: Re: zfs/arc: system seems to be spinning in
dnlc_reduce_cache(), bug In-Reply-To: <20060323163121.GZ113291 at eng.sun.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-OpenSolaris-URL: http://www.opensolaris.org/jive/message.jspa?messageID=29188&tstart=0#29188> > The diff adds lots of entries to the kernel''s dnode_t dmu_buf_impl_t > > and zfs_znode_cache kmem_caches. > > > > And starting the nightly build needs more space in other kmem areas, > > the ">75% heap full" condition is triggered, some zio_bufs and > > arc_buf* are returned to the heap arena. As a side effect dnode_t, > > dmu_buf_impl_t and zfs_znode_cache entries are freed (entries are > > intended to be reused in the same kmem_cache) but this free space is > > *not* given back to the heap arena. > > The problem is that there are 9 dmu_buf_impl_ts per "slab", and 17 > dnode''s per "slab"; a slab cannot be freed until *all* of its nodes > are free, and not cached in magazines. I''ve heard this referred to as > a slab being "held hostage" by an outstanding allocation.But I''ve observed cases where mdb "::kmastat" reported a total of 1818 dmu_buf_impl_t in use, and 94290 dmu_buf_impl_t allocated. The worst case would be that the 1818 entries use 1818 magazines (with one entry still in use and 8 entries already freed). So there must be at least (94290/9 - 1818) magazines that are completely free and could be returned to the heap_arena. Same for the dnode_t and zfs_znode_cache. The problem with an unmodified snv_37 on x86 32bit is that kmem_reap() might not be called at all, from usr/src/uts/common/vm/vm_page.c usr/src/uts/common/vm/page_retire.c usr/src/uts/common/vm/seg_kmem.c usr/src/uts/common/os/vm_pageout.c on systems that have too *much* physical memory installed. Btw. a workaround to get the kernel out of the "poor nightly build performance" mode is to start a process that consumes all of free physical memory in the system; note the amount of free memory reported by "vmstat 5", and run a "dd if=/dev/zero of=/dev/null bs=${free_mem}k count=2" (where ${free_mem} is slightly less than the the free memory reported by vmstat). The vm code (usr/src/uts/common/vm/vm_page.c, page_resv()? page_mem_avail() ? ) will detect a memory shortage and calls kmem_reap(), which returns a significant amount of free memory from kmem caches to the heap arena.> kmem_reap() flushes the magazines to the slabs (except for a 15-second > working set). Usually, this frees up memory, but there aren''t any > guarantees.But arc_kmem_reap_now() already implements a "limited kmem_reap() " for the zio_buf_cache[] caches, the buf_cache and the header cache, by calling kmem_cache_reap_now(): http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/arc.c#1224 1224 for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) { 1225 if (zio_buf_cache[i] != prev_cache) { 1226 prev_cache = zio_buf_cache[i]; 1227 kmem_cache_reap_now(zio_buf_cache[i]); 1228 } 1229 } 1230 kmem_cache_reap_now(buf_cache); 1231 kmem_cache_reap_now(hdr_cache); Why not extend this to *all* kmem_caches that are in use in the system?> It might make sense to invent a "desperate" reaping mode, which kicks > in only when the kmem_reap fails to free up any memory. But this is still > not guaranteed to free up more memory, so it might just slow things down > more when we''re already under heavy load.This message posted from opensolaris.org
Jonathan Adams
2006-Mar-23 18:59 UTC
[zfs-discuss] Re: Re: zfs/arc: system seems to be spinning in
On Thu, Mar 23, 2006 at 10:15:55AM -0800, J?rgen Keil wrote:> dnlc_reduce_cache(), bug > In-Reply-To: <20060323163121.GZ113291 at eng.sun.com> > Mime-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 7bit > X-OpenSolaris-URL: http://www.opensolaris.org/jive/message.jspa?messageID=29188&tstart=0#29188 > > > > The diff adds lots of entries to the kernel''s dnode_t dmu_buf_impl_t > > > and zfs_znode_cache kmem_caches. > > > > > > And starting the nightly build needs more space in other kmem areas, > > > the ">75% heap full" condition is triggered, some zio_bufs and > > > arc_buf* are returned to the heap arena. As a side effect dnode_t, > > > dmu_buf_impl_t and zfs_znode_cache entries are freed (entries are > > > intended to be reused in the same kmem_cache) but this free space is > > > *not* given back to the heap arena. > > > > The problem is that there are 9 dmu_buf_impl_ts per "slab", and 17 > > dnode''s per "slab"; a slab cannot be freed until *all* of its nodes > > are free, and not cached in magazines. I''ve heard this referred to as > > a slab being "held hostage" by an outstanding allocation. > > But I''ve observed cases where mdb "::kmastat" reported a total of > 1818 dmu_buf_impl_t in use, and 94290 dmu_buf_impl_t allocated. > The worst case would be that the 1818 entries use 1818 magazines > (with one entry still in use and 8 entries already freed). So there must be > at least (94290/9 - 1818) magazines that are completely free and could > be returned to the heap_arena. > > Same for the dnode_t and zfs_znode_cache.If you''ve got a dump, could you send the output of the following:> ::kmastat ! grep dmu_bufdmu_buf_impl_t 408 188753 188757 85905408 295964 0> ::kmem_cache ! grep dmu_bufffffffff8ebd2008 dmu_buf_impl_t 020f 000000 408 188757> ffffffff8ebd2008::print -d kmem_cache_t cache_full cache_empty{ cache_full.ml_list = 0 cache_full.ml_total = 0 cache_full.ml_min = 0 cache_full.ml_reaplimit = 0 cache_full.ml_alloc = 0t5046 } { cache_empty.ml_list = 0xfffffe839786edc0 cache_empty.ml_total = 0t1015 cache_empty.ml_min = 0t1015 cache_empty.ml_reaplimit = 0t1015 cache_empty.ml_alloc = 0t5049 }>along with the same for dnode_t and zfs_znode_cache, and the output of:> kmem_reaping::print(MIN(ml_min,ml_reaplimit) is the number of magazines which will be freed up by a reap, ml_total is the total number of magazines in the cache. cache_full contains full magazines, which are the ones holding down buffers) If kmem_reaping is 0, then no reap has occurred in the past 15 seconds.> The problem with an unmodified snv_37 on x86 32bit is that > kmem_reap() might not be called at all, from > > usr/src/uts/common/vm/vm_page.c > usr/src/uts/common/vm/page_retire.c > usr/src/uts/common/vm/seg_kmem.c > usr/src/uts/common/os/vm_pageout.c > > on systems that have too *much* physical memory installed.Are you saying this is because the low kernel-VA problem isn''t being noticed? Or because some failing codepath is missing a call to kmem_reap()? Cheers, - jonathan> Btw. a workaround to get the kernel out of the "poor nightly build performance" > mode is to start a process that consumes all of free physical memory in the system; > note the amount of free memory reported by "vmstat 5", and run a > "dd if=/dev/zero of=/dev/null bs=${free_mem}k count=2" (where ${free_mem} is > slightly less than the the free memory reported by vmstat). > > The vm code (usr/src/uts/common/vm/vm_page.c, page_resv()? > page_mem_avail() ? ) will detect a memory shortage and calls > kmem_reap(), which returns a significant amount of free memory > from kmem caches to the heap arena. > > > > kmem_reap() flushes the magazines to the slabs (except for a 15-second > > working set). Usually, this frees up memory, but there aren''t any > > guarantees. > > But arc_kmem_reap_now() already implements a "limited kmem_reap() " > for the zio_buf_cache[] caches, the buf_cache and the header cache, > by calling kmem_cache_reap_now(): > > http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/arc.c#1224 > > 1224 for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) { > 1225 if (zio_buf_cache[i] != prev_cache) { > 1226 prev_cache = zio_buf_cache[i]; > 1227 kmem_cache_reap_now(zio_buf_cache[i]); > 1228 } > 1229 } > 1230 kmem_cache_reap_now(buf_cache); > 1231 kmem_cache_reap_now(hdr_cache); > > > Why not extend this to *all* kmem_caches that are in use in the system? > > > It might make sense to invent a "desperate" reaping mode, which kicks > > in only when the kmem_reap fails to free up any memory. But this is still > > not guaranteed to free up more memory, so it might just slow things down > > more when we''re already under heavy load. > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Jonathan Adams, Solaris Kernel Development
Jürgen Keil
2006-Mar-23 19:32 UTC
[zfs-discuss] Re: Re: Re: zfs/arc: system seems to be spinning in
tiger2 111# /usr/bin/savecore -L dumping to /dev/dsk/c1d0s1, offset 65536, content: kernel 100% done: 71986 pages dumped, compression ratio 2.91, dump succeeded System dump time: Thu Mar 23 20:18:02 2006 Constructing namelist /var/crash/tiger2/unix.4 Constructing corefile /var/crash/tiger2/vmcore.4 100% done: 71986 of 71986 pages saved tiger2 112# cd /var/crash/tiger2/ tiger2 113# mdb -k 4 Loading modules: [ unix krtld genunix specfs dtrace uppc pcplusmp ufs ip sctp usba uhci s1394 nca zfs random fcp fctl lofs audiosup md cpc fcip crypto logindmux ptm sppp nfs ipc ]> > ::kmastat ! grep dmu_bufmdb: slab e21d8860 isn''t in cache d06086f0 (in cache d060d6f0) dmu_buf_impl_t 192 1605 69027 13463552 1068486 0> ::kmem_cache ! grep dmu_bufd061e030 dmu_buf_impl_t 0000 000000 192 69027> d061e030::print -d kmem_cache_t cache_full cache_empty{ cache_full.ml_list = 0xe64ac500 cache_full.ml_total = 0t618 cache_full.ml_min = 0t612 cache_full.ml_reaplimit = 0t615 cache_full.ml_alloc = 0t77920 } { cache_empty.ml_list = 0xe64ab100 cache_empty.ml_total = 1 cache_empty.ml_min = 0 cache_empty.ml_reaplimit = 1 cache_empty.ml_alloc = 0t84457 }> ::kmastat ! grep dnode_tmdb: slab e21d8860 isn''t in cache d06086f0 (in cache d060d6f0) dnode_t 388 1500 32890 13471744 372320 0> ::kmem_cache ! grep dnode_td061e6f0 dnode_t 0000 000000 388 32890> d061e6f0::print -d kmem_cache_t cache_full cache_empty{ cache_full.ml_list = 0xe483ce20 cache_full.ml_total = 0t1083 cache_full.ml_min = 0t1079 cache_full.ml_reaplimit = 0t1081 cache_full.ml_alloc = 0t85434 } { cache_empty.ml_list = 0xe483da00 cache_empty.ml_total = 1 cache_empty.ml_min = 1 cache_empty.ml_reaplimit = 1 cache_empty.ml_alloc = 0t91537 }> ::kmastat ! grep zfs_znode_cachemdb: slab e21d8860 isn''t in cache d06086f0 (in cache d060d6f0) zfs_znode_cache 104 1281 118560 12451840 306331 0> ::kmem_cache ! grep zfs_znode_cached06206f0 zfs_znode_cache 0000 000000 104 118560> d06206f0::print -d kmem_cache_t cache_full cache_empty{ cache_full.ml_list = 0xda9ae020 cache_full.ml_total = 0t16751 cache_full.ml_min = 0t16747 cache_full.ml_reaplimit = 0t16750 cache_full.ml_alloc = 0t20763 } { cache_empty.ml_list = 0xda9a4840 cache_empty.ml_total = 1 cache_empty.ml_min = 0 cache_empty.ml_reaplimit = 0 cache_empty.ml_alloc = 0t38543 }> kmem_reaping::print0>> dnlc_nentries::print -d0t23 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<> arc::print -d{ anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0t4038144 <<<<<<<<<<<<<<<<<<<<<<<<<<< p = 0t4495360 c = 0t67108864 c_min = 0t67108864 c_max = 0t506456064 hits = 0t8460648 misses = 0t240886 deleted = 0t208735 skipped = 0t20 hash_elements = 0t11151 hash_elements_max = 0t26769 hash_collisions = 0t179583 hash_chains = 0t1502 hash_chain_max = 7 no_grow = 1 }>At the same time, during the last 10 - 15 minutes, I also had this dtrace script running, it produced *no* output: # more /rmdisk/flashdisk/dtrace/arc4.d #!/usr/sbin/dtrace -s fbt::kmem_reap:, fbt::kmem_reap_idspace: { } tiger2 112# /rmdisk/flashdisk/dtrace/arc4.d dtrace: script ''/rmdisk/flashdisk/dtrace/arc4.d'' matched 4 probes # vmstat 10 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd cd f0 s0 in sy cs us sy id 0 0 0 5172376 1284480 319 4341 10 0 0 0 5 14 1 0 0 1155 6155 795 23 12 66 0 0 0 3292648 1420728 1 5 0 0 0 0 0 3 0 0 0 891 817 818 4 1 95 0 0 0 3292804 1426056 0 0 0 0 0 0 0 0 0 0 0 940 1106 924 5 1 94 ^C This message posted from opensolaris.org
Jürgen Keil
2006-Mar-23 19:45 UTC
[zfs-discuss] Re: Re: Re: zfs/arc: system seems to be spinning in
> > The problem with an unmodified snv_37 on x86 32bit is that > > kmem_reap() might not be called at all, from > > > > usr/src/uts/common/vm/vm_page.c > > usr/src/uts/common/vm/page_retire.c > > usr/src/uts/common/vm/seg_kmem.c > > usr/src/uts/common/os/vm_pageout.c > > > > on systems that have too *much* physical memory installed. > > Are you saying this is because the low kernel-VA problem isn''t being > noticed? Or because some failing codepath is missing a call to kmem_reap()?Well, in usr/src/uts/common/vm/vm_page.c, function page_resv(), kmem_reap() is called if (availrmem < tune.t_minarmem + npages)> availrmem::print -d0t448298> tune::print -d t_minarmemt_minarmem = 0t25 I have to much free memory, so kmem_reap() isn''t called. Or in page_mem_avail(), if (availrmem < swapfs_reserve + npages && --count)> availrmem::print -d0t448267> swapfs_reserve::print -d0t4079 But it checks also checks this: int page_mem_avail(pgcnt_t npages) { ulong_t count; #if defined(__i386) if (freemem > desfree + npages && availrmem > swapfs_reserve + npages && btop(vmem_size(heap_arena, VMEM_FREE)) > tune.t_minarmem + npages) return (1); #else> freemem::print -d0t352476> desfree::print -d0t4079> availrmem::print -d0t448327> swapfs_reserve::print -d0t4079> tune::print -d t_minarmemt_minarmem = 0t25>Note that it doesn''t care about the heap unless free heap space is less than 25*4KB, 100KB. zfs/arc is testing against > 75% of used heap space (25% of 512 MB ~128MB)? usr/src/uts/common/vm/seg_kmem.c, calls kmem_reap() when large pages are used? I''m probably not using large pages. usr/src/uts/common/os/vm_pageout.c: schedpaging() calls kmem_reap() when (freemem < lotsfree + needfree + kmem_reapahead)> freemem ::print -d0t352773> lotsfree ::print -d0t8158> needfree::print -d0> kmem_reapahead::print -d0 This message posted from opensolaris.org
Jürgen Keil
2006-Mar-23 20:19 UTC
[zfs-discuss] Re: Re: Re: zfs/arc: system seems to be spinning in
~ an hour later, after starting a nightly build, there are no big changes> ::kmastat !egrep ''dnode_t|zfs_znode_cache|dmu_buf_impl''dmu_buf_impl_t 192 2036 69027 13463552 1493017 0 dnode_t 388 1955 32890 13471744 550720 0 zfs_znode_cache 104 1583 118560 12451840 481819 0> ::kmem_cache !egrep ''dnode_t|zfs_znode_cache|dmu_buf_impl''d061e030 dmu_buf_impl_t 0000 000000 192 69027 d061e6f0 dnode_t 0000 000000 388 32890 d06206f0 zfs_znode_cache 0000 000000 104 118560> d061e030::print -d kmem_cache_t cache_full cache_empty{ cache_full.ml_list = 0xe64a7c40 cache_full.ml_total = 0t599 cache_full.ml_min = 0t585 cache_full.ml_reaplimit = 0t565 cache_full.ml_alloc = 0t95201 } { cache_empty.ml_list = 0xe64ace80 cache_empty.ml_total = 0t20 cache_empty.ml_min = 0t17 cache_empty.ml_reaplimit = 0t19 cache_empty.ml_alloc = 0t101719 }> d061e6f0::print -d kmem_cache_t cache_full cache_empty{ cache_full.ml_list = 0xe483d140 cache_full.ml_total = 0t1016 cache_full.ml_min = 0t990 cache_full.ml_reaplimit = 0t982 cache_full.ml_alloc = 0t103087 } { cache_empty.ml_list = 0xe48461e0 cache_empty.ml_total = 0t68 cache_empty.ml_min = 0t54 cache_empty.ml_reaplimit = 0t58 cache_empty.ml_alloc = 0t109123 }> d06206f0::print -d kmem_cache_t cache_full cache_empty{ cache_full.ml_list = 0xecdbf580 cache_full.ml_total = 0t16692 cache_full.ml_min = 0t16683 cache_full.ml_reaplimit = 0t16673 cache_full.ml_alloc = 0t38205 } { cache_empty.ml_list = 0xda9a50a0 cache_empty.ml_total = 0t60 cache_empty.ml_min = 0t29 cache_empty.ml_reaplimit = 0t34 cache_empty.ml_alloc = 0t55926 }> kmem_reaping::print0 Well ,the dtrace script has printed this: tiger2 112# /rmdisk/flashdisk/dtrace/arc4.d dtrace: script ''/rmdisk/flashdisk/dtrace/arc4.d'' matched 4 probes CPU ID FUNCTION:NAME 0 13952 kmem_reap_idspace:entry 0 13953 kmem_reap_idspace:return But still no kmem_reap() call. vmstat performance data while compiling is horrible, note the idle column: kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd cd f0 s0 in sy cs us sy id 2 0 0 3177368 1329940 80 1343 0 0 0 0 0 86 0 0 0 2986 4203 7921 38 17 45 0 0 0 3183468 1335836 24 409 0 0 0 0 0 90 0 0 0 3818 3076 10772 14 12 74 0 0 0 3185744 1338072 44 689 0 0 0 0 0 84 0 0 0 4559 2712 11824 12 12 76 0 0 0 3186336 1338620 23 442 0 0 0 0 0 89 0 0 0 3848 2758 12784 13 12 74 0 0 0 3184832 1338664 57 883 0 0 0 0 0 95 0 0 0 3596 5758 11032 29 14 57 2 0 0 3182396 1336872 26 477 0 0 0 0 0 90 0 0 0 4044 3956 10315 21 13 66 0 0 0 3180804 1335520 20 302 0 0 0 0 0 89 0 0 0 4325 2262 11460 12 11 77 0 0 0 3185996 1340152 125 1818 0 0 0 0 0 89 0 0 0 3672 3151 10929 13 14 72 0 0 0 3181216 1335656 0 123 0 0 0 0 0 91 0 0 0 4097 1512 10723 8 9 82 0 0 0 3180800 1336560 67 1070 0 0 0 0 0 89 0 0 0 3681 3367 11688 22 15 63 0 0 0 3182056 1336252 9 176 0 0 0 0 0 99 0 0 0 3948 2565 11936 15 12 72 0 0 0 3186884 1340984 45 787 0 0 0 0 0 87 0 0 0 4252 1798 10250 9 10 81 0 0 0 3180516 1334976 34 548 0 0 0 0 0 98 0 0 0 4080 2002 10664 9 10 80 0 0 0 3180716 1335328 32 433 0 0 0 0 0 90 0 0 0 4436 1450 10613 8 9 82 0 0 0 3189224 1343052 39 577 0 0 0 0 0 84 0 0 0 4288 2767 11859 11 12 77 1 0 0 3192568 1346768 63 1126 0 0 0 0 0 88 0 0 0 4225 2396 10767 9 12 78 This message posted from opensolaris.org
Jonathan Adams
2006-Mar-23 20:54 UTC
[zfs-discuss] Re: Re: Re: zfs/arc: system seems to be spinning in
On Thu, Mar 23, 2006 at 11:45:43AM -0800, J?rgen Keil wrote:> > > The problem with an unmodified snv_37 on x86 32bit is that > > > kmem_reap() might not be called at all, from > > > > > > usr/src/uts/common/vm/vm_page.c > > > usr/src/uts/common/vm/page_retire.c > > > usr/src/uts/common/vm/seg_kmem.c > > > usr/src/uts/common/os/vm_pageout.c > > > > > > on systems that have too *much* physical memory installed. > > > > Are you saying this is because the low kernel-VA problem isn''t being > > noticed? Or because some failing codepath is missing a call to kmem_reap()? > > > Well, in usr/src/uts/common/vm/vm_page.c, function page_resv(), > kmem_reap() is called if (availrmem < tune.t_minarmem + npages) > > > availrmem::print -d > 0t448298 > > tune::print -d t_minarmem > t_minarmem = 0t25 > > > > I have to much free memory, so kmem_reap() isn''t called. > > > Or in page_mem_avail(), > if (availrmem < swapfs_reserve + npages && --count) > > > availrmem::print -d > 0t448267 > > swapfs_reserve::print -d > 0t4079 > > > But it checks also checks this: > > int > page_mem_avail(pgcnt_t npages) > { > ulong_t count; > > #if defined(__i386) > if (freemem > desfree + npages && > availrmem > swapfs_reserve + npages && > btop(vmem_size(heap_arena, VMEM_FREE)) > tune.t_minarmem + > npages) > return (1); > #else > > > > freemem::print -d > 0t352476 > > desfree::print -d > 0t4079 > > availrmem::print -d > 0t448327 > > swapfs_reserve::print -d > 0t4079 > > tune::print -d t_minarmem > t_minarmem = 0t25 > > > > > Note that it doesn''t care about the heap unless free heap space is > less than 25*4KB, 100KB. zfs/arc is testing against > 75% of used heap > space (25% of 512 MB ~128MB)?Hrm. If we ever *completely* run out of kernel VA, then uts/common/os/vmem.c will call kmem_reap(). But the above code in page_mem_avail() is trying to avoid ever reaching that condition. The problem is that, instead of calling kmem_reap() when VA space is low, it just fails the fork(). So I agree that we''re missing calls to kmem_reap() in the "low VA" case. There are a couple different places the call could go. schedpaging() or page_mem_avail() seem like likely candidates. Cheers, - jonathan> > usr/src/uts/common/os/vm_pageout.c: schedpaging() calls kmem_reap() when > (freemem < lotsfree + needfree + kmem_reapahead) > > > freemem ::print -d > 0t352773 > > lotsfree ::print -d > 0t8158 > > needfree::print -d > 0 > > kmem_reapahead::print -d > 0 > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Jonathan Adams, Solaris Kernel Development