Jürgen Keil
2006-Aug-07 10:08 UTC
[zfs-discuss] snv_46: hangs when using zvol swap and the system is low on free memory ?
I''ve tried to use "dmake lint" on on-src-20060731, and was running out of swap on my Tecra S1 laptop, 32-bit x86, 768MB main memory, with a 512MB swap slice. The "FULL KERNEL: global crosschecks:" lint run consumes lots (~800MB) of space in /tmp, so the system was running out of swap space. To fix this I''ve tried to add a 512MB (compressed) zvol device as additional swap space. Now the dmake lint hangs the OS, sooner or later. In a crash dump, I found: [b]> ::pgrep lintS PID PPID PGID SID UID FLAGS ADDR NAME R 10806 10805 747 712 109 0x42004000 e2d158c8 lint R 10802 10801 747 712 109 0x42004000 e2d1b990 lint> e2d158c8::walk thread|::findstack -vstack pointer for thread e3806800: dc39adf8 dc39ae2c swtch+0x168() dc39ae64 turnstile_block+0x6a5(da709288, 0, e39d1110, fec045e0, 0, 0) dc39aea4 rw_enter_sleep+0x13b(e39d1110, 0) dc39aec8 tmp_write+0x2d(e9fc39c0, dc39af3c, 0, daaa3978, 0) dc39af04 fop_write+0x2e(e9fc39c0, dc39af3c, 0, daaa3978, 0) dc39af84 write+0x2ac() dc39afac sys_sysenter+0x104()> e2d1b990::walk thread|::findstack -vstack pointer for thread e0f11400: d32c47b8 d32c47e4 swtch+0x168() d32c47f4 cv_wait+0x4e(fec1ef42, fec1cf20) d32c4820 page_create_throttle+0x123(20, 3) d32c488c page_create_va+0x9f(fec20990, da8d1000, 0, 20000, 3, d32c48b4) d32c48ec segkmem_page_create+0x67(da8d1000, 20000, 0, 0) d32c4924 segkmem_xalloc+0xa3(da00f690, 0, 20000, 0, 0, fe840f08) d32c4950 segkmem_alloc+0xa0(da00f690, 20000, 0) d32c49ec vmem_xalloc+0x405(da010000, 20000, 1000, 0, 0, 0) d32c4a3c vmem_alloc+0x126(da010000, 20000, 0) d32c4a94 kmem_slab_create+0x6e(d3e21030, 0) d32c4ac0 kmem_slab_alloc+0x59(d3e21030, 0) d32c4af0 kmem_cache_alloc+0x119(d3e21030, 0) d32c4b04 zio_buf_alloc+0x1b(20000) d32c4b40 arc_read+0x332(0, da1f6740, e31f5100, f9a1c2d0, 0, 0) d32c4bbc dbuf_prefetch+0x124(d64e1640, 22, 0) d32c4bf4 dmu_zfetch_fetch+0x48(d64e1640, 20, 0, 6, 0) d32c4c54 dmu_zfetch_dofetch+0x183(d64e179c, ee0ecdb0) d32c4ca0 dmu_zfetch_find+0x530(d64e179c, d32c4cc8, 20) d32c4d24 dmu_zfetch+0xbf(d64e179c, 180000, 0, 20000, 0, 20) d32c4d5c dbuf_read+0xc9(d89ce4a0, e7d6e500, 32) d32c4db4 dmu_buf_hold_array_by_dnode+0x1fe(d64e1640, 180000, 0, 20000, 0, 1) d32c4de4 dmu_buf_hold_array_by_bonus+0x2a(e28686d8, 180000, 0, 20000, 0, 1) d32c4e68 zfs_read+0x17e(ddd27900, d32c4f3c, 0, daaa3978, 0) d32c4ea4 fop_read+0x2e(ddd27900, d32c4f3c, 0, daaa3978, 0) d32c4f84 read+0x2a1() d32c4fac sys_sysenter+0x104()> freemem/Dfreemem: freemem: 0 [/b] arc_read() needs a new buffer, tries to allocate kernel memory with KM_SLEEP. But there is no more free memory, so the allocation sleeps until resources become available. It seems that arc_read() is trying to restore a buffer from the arc ghost cache, and has the arc_buf_hdr_t locked while trying to allocate memory. At the same time, the pageout deamon seems to be stuck in the zfs code, like this: [b]> ::pgrep pageout|::walk thread|::findstack -vstack pointer for thread d386dc00: d38988a8 d38988dc swtch+0x168() d3898914 turnstile_block+0x6a5(d3da3e90, 0, d3dce0cc, fec03b38, 0, 0) d3898974 mutex_vector_enter+0x2dc(d3dce0cc) d38989b4 buf_hash_find+0x4d(da1f6740, e5476900, bb305, 0, d38989fc) d3898a00 arc_read+0x24(0, da1f6740, e5476900, f9a1c2d0, 0, 0) d3898a7c dbuf_prefetch+0x124(e0c9e318, 3832, 0) d3898ab4 dmu_zfetch_fetch+0x48(e0c9e318, 3832, 0, 1, 0) d3898b14 dmu_zfetch_dofetch+0x183(e0c9e474, d8b46c60) d3898b60 dmu_zfetch_find+0x530(e0c9e474, d3898b88, 20) d3898be4 dmu_zfetch+0xbf(e0c9e474, 6972000, 0, 2000, 0, 20) d3898c10 dbuf_read+0xc9(df5679d8, df43eb00, 22) d3898c34 dmu_tx_check_ioerr+0x49(df43eb00, e0c9e318, 0, 34b9, 0) d3898c94 dmu_tx_count_write+0x114(d3875c98, 6965000, 0, e000, 0) d3898cdc dmu_tx_hold_write+0x52(eef5d5a8, 1, 0, 6965000, 0, e000) d3898d5c zvol_strategy+0x184(d972e1e8) d3898d78 bdev_strategy+0x4d(d972e1e8) d3898d94 spec_startio+0x6e(d8b85240, fde36240, 6965000, 0, e000, 8500) d3898dc0 spec_pageio+0x2a(d8b85240, fde36240, 6965000, 0, e000, 8500) d3898e0c fop_pageio+0x2d(d8b85240, fde36240, 6965000, 0, e000, 8500) d3898e80 tmp_putapage+0x177(e9fc39c0, fdbd63b0, d3898eb8, d3898ef0, 8400, da710e68) d3898ef4 tmp_putpage+0x1c6(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68) d3898f3c fop_putpage+0x27(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68) d3898f94 pageout+0x205(0, 0) d3898fa4 thread_start+8() [/b] It seems the problem is that arc_read() has part of the buf hash table locked, then goes to sleep inside some kmem_*alloc(...KM_SLEEP) call. When the pageout daemon tries to access some zfs backed page that happens to use the same hash chain that is locked by the previous arc_read() call, the system is stuck and I have to power cycle it. I made more tests with uncompressed zvol devices, too. But the problem basically remains the same. The pageout deamon becomes stuck, some kernel zfs theads are waiting for free kernel memory and have part of the arc buffer hash table locked, other zfs threads are waiting for arc buffer hash table locks. There is no more progress, the system must be power cycled. Shouldn''t zfs be a bit more careful with KM_SLEEP allocations and locks? This message posted from opensolaris.org
Mark Maybee
2006-Aug-07 13:14 UTC
[zfs-discuss] snv_46: hangs when using zvol swap and the system is low on free memory ?
J?rgen Keil wrote:> I''ve tried to use "dmake lint" on on-src-20060731, and was running out of swap on my > Tecra S1 laptop, 32-bit x86, 768MB main memory, with a 512MB swap slice. > The "FULL KERNEL: global crosschecks:" lint run consumes lots (~800MB) of space > in /tmp, so the system was running out of swap space. > > To fix this I''ve tried to add a 512MB (compressed) zvol device as additional swap space. > > > Now the dmake lint hangs the OS, sooner or later. > > > In a crash dump, I found: > > [b] > >>::pgrep lint > > S PID PPID PGID SID UID FLAGS ADDR NAME > R 10806 10805 747 712 109 0x42004000 e2d158c8 lint > R 10802 10801 747 712 109 0x42004000 e2d1b990 lint > >>e2d158c8::walk thread|::findstack -v > > stack pointer for thread e3806800: dc39adf8 > dc39ae2c swtch+0x168() > dc39ae64 turnstile_block+0x6a5(da709288, 0, e39d1110, fec045e0, 0, 0) > dc39aea4 rw_enter_sleep+0x13b(e39d1110, 0) > dc39aec8 tmp_write+0x2d(e9fc39c0, dc39af3c, 0, daaa3978, 0) > dc39af04 fop_write+0x2e(e9fc39c0, dc39af3c, 0, daaa3978, 0) > dc39af84 write+0x2ac() > dc39afac sys_sysenter+0x104() > >>e2d1b990::walk thread|::findstack -v > > stack pointer for thread e0f11400: d32c47b8 > d32c47e4 swtch+0x168() > d32c47f4 cv_wait+0x4e(fec1ef42, fec1cf20) > d32c4820 page_create_throttle+0x123(20, 3) > d32c488c page_create_va+0x9f(fec20990, da8d1000, 0, 20000, 3, d32c48b4) > d32c48ec segkmem_page_create+0x67(da8d1000, 20000, 0, 0) > d32c4924 segkmem_xalloc+0xa3(da00f690, 0, 20000, 0, 0, fe840f08) > d32c4950 segkmem_alloc+0xa0(da00f690, 20000, 0) > d32c49ec vmem_xalloc+0x405(da010000, 20000, 1000, 0, 0, 0) > d32c4a3c vmem_alloc+0x126(da010000, 20000, 0) > d32c4a94 kmem_slab_create+0x6e(d3e21030, 0) > d32c4ac0 kmem_slab_alloc+0x59(d3e21030, 0) > d32c4af0 kmem_cache_alloc+0x119(d3e21030, 0) > d32c4b04 zio_buf_alloc+0x1b(20000) > d32c4b40 arc_read+0x332(0, da1f6740, e31f5100, f9a1c2d0, 0, 0) > d32c4bbc dbuf_prefetch+0x124(d64e1640, 22, 0) > d32c4bf4 dmu_zfetch_fetch+0x48(d64e1640, 20, 0, 6, 0) > d32c4c54 dmu_zfetch_dofetch+0x183(d64e179c, ee0ecdb0) > d32c4ca0 dmu_zfetch_find+0x530(d64e179c, d32c4cc8, 20) > d32c4d24 dmu_zfetch+0xbf(d64e179c, 180000, 0, 20000, 0, 20) > d32c4d5c dbuf_read+0xc9(d89ce4a0, e7d6e500, 32) > d32c4db4 dmu_buf_hold_array_by_dnode+0x1fe(d64e1640, 180000, 0, 20000, 0, 1) > d32c4de4 dmu_buf_hold_array_by_bonus+0x2a(e28686d8, 180000, 0, 20000, 0, 1) > d32c4e68 zfs_read+0x17e(ddd27900, d32c4f3c, 0, daaa3978, 0) > d32c4ea4 fop_read+0x2e(ddd27900, d32c4f3c, 0, daaa3978, 0) > d32c4f84 read+0x2a1() > d32c4fac sys_sysenter+0x104() > >>freemem/D > > freemem: > freemem: 0 > [/b] > > arc_read() needs a new buffer, tries to allocate kernel memory with KM_SLEEP. > But there is no more free memory, so the allocation sleeps until resources become > available. It seems that arc_read() is trying to restore a buffer from the arc > ghost cache, and has the arc_buf_hdr_t locked while trying to allocate memory. > > > At the same time, the pageout deamon seems to be stuck in the zfs code, like this: > > [b] > >>::pgrep pageout|::walk thread|::findstack -v > > stack pointer for thread d386dc00: d38988a8 > d38988dc swtch+0x168() > d3898914 turnstile_block+0x6a5(d3da3e90, 0, d3dce0cc, fec03b38, 0, 0) > d3898974 mutex_vector_enter+0x2dc(d3dce0cc) > d38989b4 buf_hash_find+0x4d(da1f6740, e5476900, bb305, 0, d38989fc) > d3898a00 arc_read+0x24(0, da1f6740, e5476900, f9a1c2d0, 0, 0) > d3898a7c dbuf_prefetch+0x124(e0c9e318, 3832, 0) > d3898ab4 dmu_zfetch_fetch+0x48(e0c9e318, 3832, 0, 1, 0) > d3898b14 dmu_zfetch_dofetch+0x183(e0c9e474, d8b46c60) > d3898b60 dmu_zfetch_find+0x530(e0c9e474, d3898b88, 20) > d3898be4 dmu_zfetch+0xbf(e0c9e474, 6972000, 0, 2000, 0, 20) > d3898c10 dbuf_read+0xc9(df5679d8, df43eb00, 22) > d3898c34 dmu_tx_check_ioerr+0x49(df43eb00, e0c9e318, 0, 34b9, 0) > d3898c94 dmu_tx_count_write+0x114(d3875c98, 6965000, 0, e000, 0) > d3898cdc dmu_tx_hold_write+0x52(eef5d5a8, 1, 0, 6965000, 0, e000) > d3898d5c zvol_strategy+0x184(d972e1e8) > d3898d78 bdev_strategy+0x4d(d972e1e8) > d3898d94 spec_startio+0x6e(d8b85240, fde36240, 6965000, 0, e000, 8500) > d3898dc0 spec_pageio+0x2a(d8b85240, fde36240, 6965000, 0, e000, 8500) > d3898e0c fop_pageio+0x2d(d8b85240, fde36240, 6965000, 0, e000, 8500) > d3898e80 tmp_putapage+0x177(e9fc39c0, fdbd63b0, d3898eb8, d3898ef0, 8400, da710e68) > d3898ef4 tmp_putpage+0x1c6(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68) > d3898f3c fop_putpage+0x27(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68) > d3898f94 pageout+0x205(0, 0) > d3898fa4 thread_start+8() > [/b] > > It seems the problem is that arc_read() has part of the buf hash table locked, > then goes to sleep inside some kmem_*alloc(...KM_SLEEP) call. > When the pageout daemon tries to access some zfs backed page that happens > to use the same hash chain that is locked by the previous arc_read() call, the > system is stuck and I have to power cycle it. > > > I made more tests with uncompressed zvol devices, too. But the problem basically > remains the same. The pageout deamon becomes stuck, some kernel zfs theads > are waiting for free kernel memory and have part of the arc buffer hash table > locked, other zfs threads are waiting for arc buffer hash table locks. There is > no more progress, the system must be power cycled. > > > > Shouldn''t zfs be a bit more careful with KM_SLEEP allocations and locks? >Absolutely. I''ll file a bug on this. -Mark