hi, a while ago i posted about a problem that occurred, apparently because of ZFS: http://www.opensolaris.org/jive/thread.jspa?threadID=3704 after this one instance, the problem did not recur, so i assumed it was a fluke of some kind. a few days ago, i copied around eight several-MB files from one filesystem to another. upon realising that i had misnamed one, i attempted to "mv" it. however, the mv did not complete, and vmstat showed the same problem as before. this time, i saved a crash dump (with savecore -L) that i can supply if it''s useful, although since it would take rather a while to upload, i''m also happy to run mdb commands on it. between the two occurrences, i moved the two disks in question to a different system, so i don''t think it''s a hardware issue, although it could be an oddity of the disks... k. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 185 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051225/fc2aa51b/attachment.bin>
James C. McPherson
2005-Dec-28 22:00 UTC
[zfs-discuss] high system cpu use/lock contention (redux)
kate wrote: ...> this time, i saved a crash dump (with savecore -L) that i can supply if it''s > useful, although since it would take rather a while to upload, i''m also > happy to run mdb commands on it.Hi Kate, for starters, if you could run ::status ::msgbuf *panic_thread::findstack -v If you can compress the crashdump and make it available somewhere, then I/somebody can pull it in and analyse it further. thanks in advance, James C. McPherson -- Solaris Datapath Engineering Data Management Group Sun Microsystems
hi james :-) James C. McPherson: JC> for starters, if you could run JC> ::status debugging crash dump ./vmcore.9 (32-bit) from hyacinth.local operating system: 5.11 brandz_release (i86pc) panic message: dump content: kernel pages only JC> ::msgbuf MESSAGE WARNING: gld_start: gldm_send failed -1 WARNING: rtls0: send fail--CHIP ERROR! <similar messages elided> pseudo-device: devinfo0 devinfo0 is /pseudo/devinfo at 0 pseudo-device: devinfo0 devinfo0 is /pseudo/devinfo at 0 dump on /dev/dsk/c2d0s1 size 1589 MB dumping to /dev/dsk/c2d0s1, offset 65536, content: kernel JC> *panic_thread::findstack -v mdb: couldn''t read thread at 0 i assume the last is because it''s from a running system rather than an actual crash dump. JC> If you can compress the crashdump and make it available somewhere, then JC> I/somebody can pull it in and analyse it further. (sent privately...) k. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 185 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051229/142b2795/attachment.bin>
James C. McPherson
2005-Dec-29 07:43 UTC
[zfs-discuss] high system cpu use/lock contention (redux)
kate wrote:> debugging crash dump ./vmcore.9 (32-bit) from hyacinth.local > operating system: 5.11 brandz_release (i86pc) > panic message: > dump content: kernel pages only > > JC> ::msgbuf > MESSAGE > WARNING: gld_start: gldm_send failed -1 > WARNING: rtls0: send fail--CHIP ERROR! > <similar messages elided> > pseudo-device: devinfo0 > devinfo0 is /pseudo/devinfo at 0 > pseudo-device: devinfo0 > devinfo0 is /pseudo/devinfo at 0 > dump on /dev/dsk/c2d0s1 size 1589 MB > dumping to /dev/dsk/c2d0s1, offset 65536, content: kernel > > JC> *panic_thread::findstack -v > mdb: couldn''t read thread at 0 > > i assume the last is because it''s from a running system rather than an > actual crash dump.doh! I should have read your email more carefully. [from the crash dump] SolarisCAT(vmcore.9/11x)> proc -L 232339 addr PID PPID RUID/UID size rss swresv time command ---------- ------ ------ ---------- ---------- -------- -------- ------ --------- 0xcd376208 232339 1 0 1105920 0 98304 18216 mv 02 - Melancholy (Holy Martyr).mp3 05_melancholy.mp3 ==== user (LWP_SYS) thread: 0xcefafa00 PID: 232339 ===cmd: mv 02 - Melancholy (Holy Martyr).mp3 05_melancholy.mp3 t_procp: 0xcd376208 p_as: 0xce398da8 hat: 0xce4f8e58 size: 1105920 rss: 0 zone: global t_stk: 0xca1d7fa4 sp: 0xca1d7d68 t_stkbase: 0xca1d6000 t_pri: 10(IA) pctcpu: 47.431519 t_lwp: 0xcc677200 lwp_regs: 0xca1d7fa4 psrset: 0 last CPU: 0 idle: 105 ticks (1.05 seconds) start: Fri Dec 23 01:38:34 2005 age: 443 seconds (7 minutes 23 seconds) stime: 15353531 (1.04 seconds earlier) syscall: #134 rename(, 0x0) (sysent: address translation failed for sysent: 16 bytes @ 0x860 ) tstate: TS_RUN - runnable, but not yet on a processor tflg: T_DFLTSTK - stack is default size tpflg: TP_TWAIT - wait to be freed by lwp_wait TP_MSACCT - collect micro-state accounting information tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SKILLED - SIGKILL has been posted to the process SMSACCT - process is keeping micro-state accounting SMSFORK - child inherits micro-state accounting pc: 0xfe832637 unix:0xfffffffffe832637 unix:_resume_from_idle+0x96 resume_return() genunix:cv_wait+0x5d() 0xfa0a5943() 0xfa0c8c59() genunix:fop_rename+0x1e() genunix:vn_renameat+0x1e2() genunix:vn_rename+0x18() genunix:rename+0x13() unix:sys_sysenter+0x1a3 _sysenter_done() -- switch to user thread''s user stack -- [Since SolarisCAT for x86/x64 isn''t quite working right just yet, we''ll jump back and forth between it and mdb] > 0t232339::pid2proc|::print proc_t p_tlist|::findstack -v stack pointer for thread cefafa00: ca1d7d68 ca1d7d94 swtch+0x1f6() ca1d7da0 cv_wait+0x5d(c87bdaee, c87bdaac) ca1d7ddc txg_wait_open+0xe9(c87bda00, 1b1b05fe, 0) ca1d7e44 zfs_rename+0x36a() ca1d7e64 fop_rename+0x1e(c9430180, c86bad80, c9430180, d5992000, cd021cf8) ca1d7f34 vn_renameat+0x1e2() ca1d7f5c vn_rename+0x18(80479ef, 8047490, 0) ca1d7f7c rename+0x13() ca1d7fa4 sys_sysenter+0x1a3() Looking through the zfs-related threads which have a significant percentage of cpu accumulated: SolarisCAT(vmcore.9/11x)> tlist findmodule zfs|egrep "wchan|pctcpu" [[[ elided] ... t_wchan: 0xc8e413d4 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 97.175514 t_lwp: 0x0 t_wchan: 0xc9cccd14 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 97.319267 t_lwp: 0x0 t_wchan: 0xc9ccca14 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 97.333702 t_lwp: 0x0 t_wchan: 0xc9ccc714 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 97.342072 t_lwp: 0x0 t_wchan: 0xc9ccc414 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 97.349762 t_lwp: 0x0 t_wchan: 0xc9ccc114 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 97.353722 t_lwp: 0x0 t_wchan: 0xc9cbad54 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 97.356201 t_lwp: 0x0 t_wchan: 0xcd8ddd14 sobj: condition var (from zfs:zfs_delete_thread+0x160) t_pri: 60(SYS) pctcpu: 100.000000 t_lwp: 0x0 this last thread in particular is of concern due to its immense cputime: > 0xca1adde0::findstack -v stack pointer for thread ca1adde0: ca1add48 ca1add74 swtch+0x1f6() ca1add84 cv_wait+0x5d(cd8ddd14, cd8ddd0c) ca1addc8 zfs_delete_thread+0x160(cd8ddcc0, 0) ca1addd8 thread_start+8() The memory usage in this system does not seem to be excessive, except for one particular zfs-related cache: buf buf buf memory #allocations cache name size avail total in use succeed fail ------------------------- ------ ------- ------- --------- ---------- ---- zio_buf_65536 65536 2 324 21233664 47691 0 So what I reckon has happened is that we have (a) run out of the appropriately sized zio_buf* cache entries, and (b) we have a deadlock between the relevant zfs_delete_thread worker threads as a result. Code-wise we are in usr/src/uts/common/fs/zfs/txg.c: 382 void 383 txg_wait_open(dsl_pool_t *dp, uint64_t txg) 384 { 385 tx_state_t *tx = &dp->dp_tx; 386 387 mutex_enter(&tx->tx_sync_lock); 388 ASSERT(tx->tx_threads == 3); 389 if (txg == 0) 390 txg = tx->tx_open_txg + 1; 391 if (tx->tx_quiesce_txg_waiting < txg) 392 tx->tx_quiesce_txg_waiting = txg; 393 dprintf("txg=%llu quiesce_txg=%llu sync_txg=%llu\n", 394 txg, tx->tx_quiesce_txg_waiting, tx->tx_sync_txg_waiting); 395 while (tx->tx_open_txg < txg) { 396 cv_broadcast(&tx->tx_quiesce_more_cv); 397 cv_wait(&tx->tx_quiesce_done_cv, &tx->tx_sync_lock); 398 } 399 mutex_exit(&tx->tx_sync_lock); 400 } at line 397. I''m logging this as a new bug - as far as I can determine the others that were similar have all be closed as resolved quite some time ago. best regards, James C. McPherson -- Solaris Datapath Engineering Data Management Group Sun Microsystems